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unit and a adder circuit to execute operations over vec- , 
tors of programmable word length data. Increasing of 
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the ability to process vector of input operands with pro- 
grammable word length at a time. Said unit comprises a 
carry look-ahead circuit and a carry propagation circuit, 
and also by two multiplexers, one EXCLUSIVE OR gate, 
one EQUIVALENCE gate, one NAND gate and one 
AND gate with inverted input in each bit Functionality of 
the calculation unit is expanded. The calculation unit 
comprises a delay element N/2 AND gates with inverted 
input N/2 decoders of multiplier bits, a N-bit shift regis- 
ter, which each bit consists of a AND gate with inverted 
inputs, a multiplexer and a trigger, and a multiplier array, 
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one-bit partial product generation circuit, an one-bit 
adder and a multiplexer. Increasing of the adder circuit 
performance is achieved by means of ability to sum two 
vectors of input operands of programmable word 
lengths. The adder circuit comprises a carry look-ahead 
circuit, and also by two AND gates with inverted input, 
one half-adder and one EXCLUSIVE OR gate in each 
bit 
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Description 

FIELD OF THE INVENTION 

s [0001 ] The group of the inventions relates to the field of computer science and can be used for neural network emu- 
lation and real-time digital signal processing. 

BACKGROUND OF THE INVENTION 

10 [0002] A neural processor is known [Principal Directions of Hardware Development of Neural Network Algorithms 
Implementation / YP. Ivanov and others (Theses of reports of the Second Russian Conference < ( Neural Computers And 
Their Application)) , Moscow, 14 February, 1996) // Neurocomputer. - 1996. - A' 2 1,2. - pp.47-49], comprising an input 
data register and four neural units, each of them consists of a shift register, a weight coefficient register, eight multipli- 
ers, a multi-operand summation circuit and a block for the threshold function calculation. 

rs [0003] Such neural processor executes weighted summation of fixed amount of input data for a fixed number of 
neurons in each clock cycle irrespective of the real range of input data values and their weight coefficients. In this case 
every input data as well as every weight coefficient are presented in the form of an operand with a fixed word length, 
determined by the bit length of the neural processor hardware units. 

[0004] The closest one is neural processor [U.S. Patent, X* 5278945, U.S. CI. 395/27, 1994], comprising three reg- 
20 isters, a multiplexer, a FIFO (First In First Out), a calculation unit to compute dot product of two vectors of programmable 
word length data with the addition of accumulated result and a nonlinear unit. 

[0005] Input data vectors and their weight coefficients are applied to the inputs of such neural processor. In each 
clock cycle the neural processor performs weighted summation of several input data for one neuron by means of calcu- 
lation the dot product of the input data vector by the weight coefficient vector. In addition the neural processor supports 
25 processing of vectors, which word length of separate elements may be selected from set of fixed values in program 
mode. With decreasing the word length of input data and weight coefficients their number in each vector increases and 
thus the neural processor performance improves. However, the word length of the obtained results is fixed and deter- 
mined by the bit length of the neural processor hardware units. 

[0006] A digital unit for saturation with saturation region, determined by absolute value of a number, is known [SU, 
30 X* 690477, Int. CI. G 06 F 7/38, 1979], comprising three registers, an adder, two code converters, two sign analyzing 
blocks, a correction block, two groups of AND gates and a group of OR gates. Such unit allows to calculate saturation 
functions for a vector with N input operands per 2N clock cycles. 

[0007] The closest one is saturation unit [U.S. Patent, X* 5644519. U.S. CI. 364/736.02, 1997], comprising a mul- 
tiplexer, a comparator and two indicators of the saturation. Such unit allows to calculate saturation functions for a vector 
35 with N input operands per N cycles. 

[0008] A calculation unit is known [U.S. Patent, -V fl 5278945, U.S. CI. 395/27, 1 994], comprising multipliers, adders, 
registers, a multiplexer and a FIFO. Said unit allows to calculate dot product of two vectors, which contains M operands 
each, per one clock cycle and to multiply of a matrix containing N x M operands by a vector consisting of M operands 
per N cycles. 

40 [0009] The closest one is calculation unit [U.S. Patent. X s 4825401 . U.S. CI. 364/760. 1989], comprising 3N/2 AND 
gates. N/2 decoders for decoding a multiplier on the basis of Booth's algorithm, a cell array of N columns by N/2 cells 
for multiplication, where each cell consists of a circuit to generate one bit of partial product on the basis of Booth's algo- 
rithm and of a one-bit adder, a 2N-bit adder, N/2 multiplexers. N/2 additional circuits to generate one bit of partial prod- 
uct on the basis of Booth's algorithm and N/2 implicators. Said unit allows to multiply two N-bit operands or to multiply 

45 element-by-element two vectors of two (N/2)-bit operands each per one clock cycle. 

[0010] A unit for summation of, vectors with programmable word length operands is known [U.S. Patent. ;V ft 
5047975, U.S. CI. 364/786, 1991], comprising adders and AND gates with inverted input. ' 

[0011] The closest one is adder [U.S. Patent, X* 4675837, U.S. CI. 364/788. 1987], comprising a carry logic and 
in its every bit - a half-adder and an EXCLUSIVE OR gate. Said adder allows to add two vectors of N operands each 
so pen N cycles. 

DISCLOSURE OF THE INVENTION. \ 

i * 

[0012] The neural processor comprises first, second, third, fourth, fifth and sixth registers, a shift register, an AND 
55 gate, first and second FIFOs, first and second saturation units, a calculation unit, incorporating inputs of first operand 
vector bits, inputs of second operand vector bits, inputs of third operand vector bits, inputs of data boundaries setting 
for f irst operand vectors and result vectors, inputs of data boundaries setting for second operand vectors, inputs of data 
boundaries setting for third operand vectors, first and second inputs of load control of third operand vectors into the first 
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memory block, input of reload control of third operand matrix from the first memory block to the second memory block 
and outputs of bits of first and second summand vectors of results of the addition of first operand vector and product of 
the multiplication of second operand vector by third operand matrix, stored into the second memory block, an adder cir- 
cuit, a switch from 3 to 2 and a multiplexer, and first data inputs of bits of the switch from 3 to 2, data inputs of the first 

5 FIFO, o1 first, second, third and fourth registers and parallel data inputs of the shift register are bit-by-bit coupled and 
connected to respective bits of first input bus of the neural processor, which each bit of second input bus is connected 
to second data input of the respective bit of the switch from 3 to 2, which first output of each bit is connected to input of 
the respective bit of input operand vector of the first saturation unit, which control input of every bit is connected to out- 
put of the corresponding bit of the second register, second output of each bit of the switch from 3 to 2 is connected to 

10 input of the respective bit of input operand vector of the second saturation unit, which control input of each bit is con- 
nected to output of respective bit of the third register, output of each bit of the first register is connected to first data input 
of the respective bit of the, multiplexer, which second data input of each bit is connected to output of the respective bit 
of result vector of the first saturation unit, output of each bit of the multiplexer is connected to input of the respective bit 
of first operand vector of the calculation unit, which input of each bit of second operand vector is connected to output of 

75 the respective bit of result vector of the second saturation unit, data outputs of the first FIFO are connected to inputs of 
the respective bits of third operand vector of the calculation unit, which output of each bit of first summand vector of 
results of the addition of first operand vector and product of the multiplication of second operand vector by third operand 
matrix, stored into the second memory block, is connected to input of respective bit of first summand vector of the adder 
circuit, which input of each bit of second summand vector is connected to output of respective bit of second summand 

20 vector of results of the addition of first operand vector and product of the multiplication of second operand vector by third 
operand matrix, stored into the second memory block of the calculation unit, which each input of data boundaries set- 
ting for first operand vectors and result vectors is connected to output of the respective bit of the fifth register and to the 
respective input of data boundaries setting for summand vectors and sum vectors of the adder circuit, which output of 
each bit of sum vector is connected to respective data input of the second FIFO, which each data output is connected 

25 to the respective bit of output bus of the neural processor and to third input of the respective bit of the switch from 3 to 
2, output of each bit of the fourth register is connected to data input of the respective bit of the fifth register and to the 
respective input of data boundaries setting for third operand vectors of the calculation unit which each input of data 
boundaries setting for second operand vectors is connected to output of the respective bit of the sixth register, which 
data input of each bit is connected to output of the respective bit of the shift register, which sequential data input and 

30 output are coupled and connected to first input of load control of third operand vectors into the first memory block of the 
calculation unit and to first input of the AND gate, which output is connected to read control input of the first FIFO, sec- 
ond input of the AND gate, shift control input of the shift register and second input of load control of third operand vec- 
tors into the first memory block of the calculation unit are coupled and connected to respective control input of the 
neural processor, input of reload control of third operand matrix from the first memory block to the second memory block 

35 of the calculation unit and control inputs of fifth and sixth registers are coupled and connected to the respective control 
input of the neural processor, control inputs of the switch from 3 to 2, of the multiplexer and of first, second, third and 
fourth register, write control inputs of the shift register and of the first FIFO and read and write control inputs of the sec- 
ond FIFO are respective control inputs of the neural processor, state outputs of first and second FIFOs are state outputs 
of the neural processor. 

40 [0013] The neural processor may include a calculation unit, comprising a shift register, performed the arithmetic 
shift of J bits left on all N-bit vector operands, stored in it, where J - minimal value that is the aliquot part of data word 
lengths in second operand vectors of the calculation unit, a delay element, a first memory block, containing sequential 
input port and N/J cells to store N-bit data, a second memory block, containing N/J cells to store N-bit data, N/J multi- 
plier blocks, each of that multiply N-bit vector of programmable word length data by J-bit multiplier, and a vector adding 
45 circuit, generated partial product of the summation of N/J + 1 programmable word length data vectors, and inputs of 
third operand vector bits of the calculation unit are connected to data inputs of the shift register, which outputs are con- 
nected to data inputs of the first memory block, which outputs of each cell are connected to data inputs of the respective 
cell of the second memory block, which outputs of each cell are connected to inputs of multiplicand vector bits of the 
respective multiplier block, which inputs of the multiplier bits are connected to inputs of the respective J-bit group of see- 
so ond operand vector bits of the calculation unit, outputs of each multiplier block are connected to inputs of bits of the 
respective summand vector of the vector adding circuit, which inputs of (N/J + 1)-th summand vector bits are connected 
to inputs of first operand vector bits of the calculation unit, which inputs of data boundaries setting for third operand vec- 
tors are connected to respective inputs of data boundaries setting for operand vectors of the shift register, which mode 
select input is connected to first input of load control of third operand vectors into the first memory block of the calcula- 
55 tion unit, which second input of load control of third operand vectors into the first memory block is connected to clock 
input of the shift register and to input of the delay element, which output is connected to write control input of the first 
memory block, write control input of the second memory block is connected to input of reload control of third operand 
matrix from the first memory block to the second memory block of the calculation unit, which every input of data bound- 
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aries setting for second operand vectors is connected to input of the sign correction of the respective multiplier block, 
inputs of data boundaries setting for first operand vectors and for result vectors of the calculation unit are connected to 
inputs of data boundaries setting for multiplicand vectors and for result vectors of each multiplier block and to inputs of 
data boundaries setting for summand vectors and result vectors of the vector adding circuit, which outputs of bits of first 
and second summand vectors of results are respective outputs of the calculation unit. 

[0014] In the described above neural processor each saturation unit may comprise an input data register, which 
data inputs are inputs of respective bits of input operand vector of said unit, the calculation unit may comprise an input 
data register which data inputs are inputs of respective bits of first and second operand vectors of the calculation unit, 
the adder circuit may comprise an input data register, which data inputs are inputs of respective inputs of the adder cir- 

[001 5] The saturation unit comprises a carry propagation circuit and a carry look-ahead circuit, and each of N bits 
of said unit comprises first and second multiplexers and an EXCLUSIVE OR gate, an EQUIVALENCE gate, a NAND 
gate and an AND gate with inverted input, and second data inputs of the first and the second multiplexers and first input 
of the EXCLUSIVE OR gate of each bit of said unit are coupled and connected to input of the respective bit of input 
operand vector of said unit, which output of each bit of result vector is connected to output of the first multiplexer of the 
respective bit of said unit, non inverted input of the AND gate with inverted input and fist inputs of the NAND gate and 
the EQUIVALENCE gate of each bit of said unit are coupled and connected to the respective control input of said unit, 
first input of the EXCLUSIVE OR gate and non inverted input of the AND gate with inverted input of q-th bit of said unit 
are respectively connected to second input of the EXCLUSIVE OR gate and to inverted input of the AND gate with 
inverted input of (q-1)-th bit of said unit, first data input of the second multiplexer of which is connected to output of the 

carry to (N-q + 2)-th bit of the carry propagation circuit (where q =2, 3 N), output of the NAND gate of n-th bit of said 

unit is connected to input of carry propagation through (N-n + 1)-th bit of the carry look-ahead circuit, which output of 
the carry to (N-n + 2)-th bit is connected to control input of the first multiplexer of n-th bit of said unit, output of the AND 
gate with inverted input of which is connected to control input of the second multiplexer of the same bit of said umt. to 
carry generation input of (N-n + l)-th bit of the carry look-ahead circuit and to inverted input of the carry propagation 
through (N-n + 1)-th bit of the carry propagation circuit, which carry input from (N-n + 1)-th bit is connected to output of 

the second multiplexer of n-th bit of said unit (where n = 1 .2 N). initial carry inputs of the carry propagation circuit and 

of the carry look-ahead circuit, second input of the EXCLUSIVE OR gate, inverted input of the AND gate with inverted 
input and first data input of the second multiplexer of N-th bit of said unit are coupled and connected to "0". and in each 
bit of said unit output of the second multiplexer is connected to second input of the EQUIVALENCE gate, which output 
is connected to first data input of the first multiplexer, and output of the EXCLUSIVE OR gate is connected to second 
input of the NAND gate of the same bit of said unit. 

[0016] In particular cases of the saturation unit usage, when there are hard demands to minimize hardware 
expenses output of the carry to q-th bit is connected to carry input from (q-1)-th bit in the carry propagation circuit 
(where q = 1 2 . N). and the carry look-ahead circuit comprises AND gates and OR gates of quantity of N both, and 
each input of the carry propagation through the respective bit of the carry look-ahead circuit is connected to first input 
of the respective AND gate, which output is connected to first input of the respective OR gate, which second input and 
output are respectively connected to carry generation input of the respective bit of the carry look-ahead circuit and to 
output of the carry to the same bit of the carry look-ahead circuit, second input of the first AND gate is initial carry input 
of the carry look-ahead circuit second input of q-th AND gate is connected to output of (q-1)-th OR gate (where q = 
2 3 N) 

[0017] The calculation unit comprises N/2 decoders of multiplier bits, N/2 AND gates with inverted input, a delay 
element a N-bit shift register, which each bit consists of an AND gate with inverted inputs, a multiplexer and a trigger, 
and a multiplier array of N columns by N/2 cells, each of them consists of an AND gate with inverted input, an one-bit 
partial product generation circuit, an one-bit adder, a multiplexer, first and second triggers, functioned us memory cells 
of respectively first and second memory blocks of said unit, and input of each bit of first operand vector of said unit is 
connected to second input of the one-bit adder of the first cell of the respective column of the multiplier array, first input 
of the one-bit adder of each cell of which is connected to output of the one-bit partial product generation circuit of the 
same cell of the multiplier array, control inputs of multiplexers and inverted inputs of the AND gates with inverted input 
of all cells of each column of which are coupled and connected to respective input of data boundaries setting for first 
operand vectors and for result vectors of said unit, which each input of data boundaries setting for second operand vec- 
tors is connected to inverted input of the respective AND gate with inverted input, which output is connected to first input 
of the respective decoder of multiplier bits, respective control inputs of the one-bit partial product generation circuits of 
. i-th cells of all columns of the multiplier array are coupled and connected to respective outputs of i-th decoder of multi- 
plier bits second and third inputs of which are connected to inputs of respectively (2i-1)-th and (2i)-th bits of second 

operand vector of said unit (where i = 1 ,2 N/2), non inverted input of j-th AND gate with inverted input is connected 

to third input of (j-D-th decoder of multiplier bits (where j = 2. 3 N/2), input of each.bit of third operand vector of said 

unit is connected to second data input of the multiplexer of the respective bit of the shift register, which first data input 
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is connected to output of the AND gate with inverted inputs of the same bit of the shift register, which first inverted input 
is connected to respective input of data boundaries setting for third operand vectors of said unit, second inverted input 
of the AND gate with inverted inputs of q-th bit of the shift register is connected to first inverted input of the AND gate 

with inverted inputs of (q-1 )-th bit of the shift register (where q = 2, 3 N), non inverted input of AND gate with inverted 

5 inputs of r-th bit of the shift register is connected to trigger output of (r-2)-th bit of the shift register (where r = 3, 4,..., N), 
control inputs of multiplexers of all shift register bits are coupled and connected to first input of load control of third oper- 
and vectors into the first memory block of said unit, clock inputs of triggers of all shift register bits and input of the delay 
element are coupled and connected to secondjnput of load control of third operand vectors into the first memory block, 
output of the multiplexer of each shift register bit is connected to data input of the trigger of the same bit of the shift reg- 
ie ister, which output is connected to data input of the first trigger of the last cell of the respective column of the multiplier 
array, output of the first trigger of j-th cell of each multiplier array column is connected to data input of the first trigger of 

(j-1)-th cell of the same multiplier array column (where j = 2, 3 N/2), clock inputs of the first triggers of all multiplier 

array cells are coupled and connected to output of the delay element, clock inputs of the second triggers of all multiplier 
array cells are coupled and connected to input of reload control of third operand matrix from the first memory block to 
75 the second memory block, second data input of the one-bit partial product generation circuit of i-th cell of q-th multiplier 
array column is connected to output of the AND gate with inverted input of i-th cell of (q-1)-th multiplier array column 

(where i = 1, 2,..., N/2 and q = 2, 3 N), second input of the one-bit adder of j-th cell of each multiplier array column 

is connected to sum output of the one-bit adder of the (j-1)-th cell of the same multiplier array column (where j = 2, 3 

N/2), third input of the one-bit adder of j-th cell of q-th multiplier array column is connected to output of the multiplexer 

20 of (j-1)-th cell of (q-1)-th multiplier array column (where j =2, 3,..., N/2 andq = 2, 3 N), third input of the one-bit adder 

of j-th cell of the first multiplier array column is connected to third output of G-1)-th decoder of multiplier bits (where j = 

2. 3 N/2), sum output of the one-bit adder of the last cell of each multiplier array column is output of the respective 

bit of first summand vector of results of said unit, output of the multiplexer of the last cell of (q-1)-th multiplier array col- 
umn is output of q-th bit of second summand vector of results of said unit (where q =2, 3 N), which first bit of second 

25 summand vector of results is connected to third output of (N/2)-th decoder of multiplier bits, second inverted and non 
inverted inputs of the AND gate with inverted inputs of the first bit and non inverted input of the AND gate with inverted 
inputs of the second bit of the shift register, second data inputs of the one-bit partial product generation circuits of all 
cells of the first column of the multiplier array, third inputs of one-bit adders of first cells of all multiplier array columns 
and non inverted input of the first AND gate with inverted input are coupled and connected to "0", and in each multiplier 
30 array cell the output of the first trigger is connected to data input of the second trigger, which output is connected to non 
inverted input of the AND gate with inverted input and to first data input of the one-bit partial product generation circuit, 
which third control input is connected to second data input of the multiplexer, which first data input is connected to carry 
output of the one-bit adder of the same cell of the multiplier array. 

[0018] The adder circuit comprises a carry look-ahead circuit, and in each of N its bits - a half-adder, an EXCLU- 
35 SIVE OR gate, first and second AND gates with inverted input, and input of each bit of first summand vector of the adder 
circuit and input of respective bit of second summand vector of the adder circuit are connected respectively to first and 
second inputs of the half-adder of respective bit of the adder circuit, inverted inputs of first and second AND gates with 
inverted input of each bit of the adder circuit are coupled and connected to respective input of data boundaries setting 
for summand vectors and sum vectors, output of the EXCLUSIVE OR gate of each bit of which is output of the respec- 
40 tive bit of sum vector of the adder circuit, output of the first AND gate with inverted input of each bit of the adder circuit 
is connected to carry propagation input through the respective bit of the carry look-ahead circuit, which carry generation 
input of each bit is connected to output of the second AND gate with inverted input of the respective bit of the adder 
circuit, second input of the EXCLUSIVE OR gate of q-th bit of the adder circuit is connected to output of the carry to q- 
th bit of the carry look-ahead circuit (where q = 2, 3,... t N) ( which initial carry input and second input of the EXCLUSIVE 
45 OR gate of the first bit of the adder circuit are connected to "0", and in each bit of the adder circuit sum output of the 
half -adder is connected to first input of the EXCLUSIVE OR gate and to non inverted input of the first AND gate with 
inverted input, and carry output of the half-adder is connected to non inverted input of the second AND gate with 
inverted input of the same bit of the adder circuit. 

so BRIEF DESCRIPTION OF THE DRAWINGS 

[0019] 

is a block diagram of the neural processor, 
illustrates the function of the saturation unit; 

is a model of a neural network layer, emulated by the neural processor; 
is a block diagram of the calculation unit; 

is a block diagram of the saturation unit of vectors of programmable word length data; 
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FIG. 6 is a block diagram of carry look-ahead circuit that can be applied in the saturation unit; 
FIG 7 is a block diagram of the calculation unit; 

FIG 8 illustrates an exemplary implementation of the decoder of multiplier bits and the one-brt partjal product gen- 
eration circuit on the basis of Booth's algorithm applied in the calculation unit; 
FIG. 9 is a block diagram of the adder circuit of vectors of programmable word length data. 

[00201 The neural processor, which block diagram is presented in FIG. 1 , comprises first 1 . second 2, third 3 fourth 
4 fifth 5 and sixth 6 registers, a shift register 7. a AND gate 8. first 9 and second 10 FIFOs, a switch from 3 tq 2 11a 
multiplexer 1 2. first 13 and second 1 4 saturation units, each of them has inputs of bits of input operand vector 1 5. control 
inputs 16 and outputs of bits of result vector 17. a calculation unit 1 8, comprising inputs of bits of first 19. of second 20 
and of third 21 operand vector, inputs of data boundaries setting for first operand vectors and result vectors 22. for sec- 
ond operand vectors 23 and for third operand vectors 24. first 25 and second 26 inputs of load control of third operand 
vectors into the first memory block, input of reload control of third operand matrix from the first memory block o the sec- 
ond memory block 27 and outputs of bits of first 28 and second 29 summand vectors of results of the addition of first 
operand vector and product of the multiplication of second operand vector by third operand matrix, stored into the sec- 
ond memory block, and an adder circuit 30, comprising inputs of bits of first 31 and second 32 summand vectors, inputs 
of data boundaries setting for summand vectors and sum vectors 33 and outputs of bits of sum vector 34 The neural 
processor has first 35 and second 36 input buses and output bus 37. Control inputs 38 of the switch from 3 to 2 1 1 . con- 
trol input 39 of the multiplexer 1 2. control input 40 of the first register 1 , control input 41 of the second register 2 control 
input 42 of the third register 3. control input 43 of the fourth register, write control input 44 of the shift register 7. write 
control input 45 of the first FIFO 9. write 46 and read 47 control inputs of the second FIF0 10 and descnbed above con- 
trol inputs 26 and 27 of the calculation unit 18 are respective control inputs of the neural processor. State outputs 48 of 
the first FIFO 9 and state outputs 49 of second FIFO 10 are state outputs of the neural processor. 
[0021] The general view of the saturation function, implemented by the neural processor, is presented in F.g.2. 
[0022] A model of a neural network layer, implemented by the neural processor, is presented in Fig 3. 
0023] Fig 4 discloses a block diagram of one of possible implementations of the calculation unit 18 for execution of 
operations over vectors of programmable word length data, comprising a shift register 50 performed the anthrnetic sh« 
of J bits left on all N-bit vector operands, stored in it. where J - minimal value that is the aliquot part of data word lengths 
in second operand vectors of the calculation unit 1 8. a delay element 51 . a first memory block 52. comaming > sequential 
input port and N/J cells to store N-bit data, a second memory block 53. conta.n.ng N/J cellsto store N-brt data N/J mul- 
tiplier blocks 54. each of that multiply N-bit vector of programmable word length data by J-b.t multiplier, and a vector 
adding circuit 55 generated partial product of the summation of N/J + 1 programmable word length data vectors. 
[0024] The saturation unit, which block diagram is presented in Fig.5. has inputs of input operand vector 15 bite 
control inputs 16 and outputs of result vector 17 bits. Each of N bits 56 of said unit comprises firsl 1 57 
multiplexers, an EXCLUSIVE OR gate 59. an EQUIVALENCE gate 60. a NAND gate 61 and an AND gate with ^verted 
input 66. Said unit includes also a carry propagation circuit 63, comprising a initial carry m Pl * 64. inverted inputs of the 
carry propagation through separate bits 65, carry inputs from separate bits 66 and outputs of the carry to separate Ms 
67 and a carry look-ahead circuit 68, comprising a initial carry input 69, inputs of the carry propagation through sepa- 
rate bits 70. carry generation inputs of separate bits 71 and outputs of the carry to separate bits 72 
[0025] As circuits 63 and 68 in the saturation unit various carry propagation circuits and carry look-ahead circuits. , 
acolied in parallel adders, may be used. 

[0026] In the simplest variant of carry propagation circuit 63 implementation output of the carry to q-th bit 67 IS con- 
nected to carry inout from (q-1)-th bit 66 (where q = 1,2 N). ■ ' . 

S3! Tg 6 Slsdoses a simplest carry look-ahead circuit, comprises AND gates 73 and OR gates 74 of quantity 
of N both. Each input of the carry propagation through the respective bit 70 of said circuit is connected to f.rst .nput of 
the respective AND gate 73. which output is connected to f irst input of the respective OR gate 7* which second input 
and output are respectively connected to carry generation input of the respective bit 71 and to output of the ca^y to the 
same bit 72 of said circuit. Second input of the first AND gate 73 is initial carry input 69 of said circuit, and second input 
of q-th AND gate 73 is connected to output of (q-1)-th OR gate 74 (where q = 2.3 .... N). ^ : 
[0028] The calculation unit, which block diagram is presented in Fig:7. compnses .nputs of f.rst 19 second 20 and 
third 21 operand vector bits, inputs of boundary setting for first operand vectors and result vectors ; 22 for second oper- 
and vectors 23 and for third operand vectors 24. first 25 and second 26 inputs of load control of third operand vectors 
into the first memory block, input of reload control of third operand matrix from the f.rst memory block to the secorri 
memory block 27 and outputs of bits of first summand vector of results 28 and of second summand vectorof results 29. 
Said unit includes a shift register 50. a delay element 51 . N/2 AND gates with inverted .nput 75 
plier bits 76. a multiplier array 77 of N columns by N/2 cells in each column. Any brt of the shrft register 50 consists o 
an AND gate with inverted inputs 78. a multiplexer 79 and a trigger 80. Each cell of the multiplier array 77 consists of 
first 81 and second 82 triggers, functioned us memory cells of respectively first and second memory blocks of said unit. 
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an AND gate with inverted input 83, an one-bit partial product generation circuit 84, an one-bit adder 85 and a multi- 
plexer 86. In Fig. 7 the columns of cells of the multiplier array 77 are numbered from right to left, and the cells of columns 
of the multiplier array 77-from top downward. 

[0029] FIG. 8 illustrates an exemplary implementation of the decoder of multiplier bits 76 and the one-bit partial 
5 product generation circuit 84 on the basis of Booth's algorithm. The decoder of multiplier bits 76 comprises an EXCLU- 
SIVE OR gate 87, an EQUIVALENCE gate 88 and a NOR gate 89. The one-bit partial product generation circuit 84 
comprises AND gates 90 and 91 , an OR gate 92 and an EXCLUSIVE OR gate 93. 

[0030] The adder circuit, which block diagram is presented in Fig. 9, has inputs of bits of first summand vector 31 
and of second summand vector 32, inputs of data boundaries setting for summand vectors and sum vectors 33 and out- 
re? puts of bits of sum vector 34. Each of N bits 94 of the adder circuit comprises a half-adder 95, an EXCLUSIVE OR gate 
96, first 97 and second 98 AND gates with inverted input. Also the adder circuit includes a carry look-ahead circuit 99. 

VARIANTS OF CARRING OUT THE INVENTION 

is [0031 ] The neural processor, which block diagram is presented in FIG. 1 , comprises first 1 , second 2, third 3, fourth 
4, fifth 5 and sixth 6 registers, a shift register 7, a AND gate 8, first 9 and second 10 FIFOs, a switch from 3 to 2 1 1 , a 
multiplexer 1 2, first 13 and second 1 4 saturation units, each of them has inputs of bits of input operand vector 1 5, control 
inputs 16 and outputs of bits of result vector 17, a calculation unit 18, comprising inputs of bits of first 19, of second 20 
and of third 21 operand vector, inputs of data boundaries setting for first operand vectors and result vectors 22, for sec- 

20 ond operand vectors 23 and for third operand vectors 24, first 25 and second 26 inputs of load control of third operand 
vectors into the first memory block, input of reload control of third operand matrix from the first memory block to the sec- 
ond memory block 27 and outputs of bits of first 28 and second 29 summand vectors of results of the addition of first 
operand vector and product of the multiplication of second operand vector by third operand matrix, stored into the sec- 
ond memory block, and an adder circuit 30, comprising inputs of bits of first 31 and second 32 summand vectors, inputs 

25 of data boundaries setting for summand vectors and sum vectors 33 and outputs of bits of sum vector 34. The neural 
processor has first 35 and second 36 input buses and output bus 37. Control inputs 38 of the switch from 3 to 2 1 1 , con- 
trol input 39 of the multiplexer 12, control input 40 of the first register 1 , control input 41 of the second register 2, control 
input 42 of the third register 3, control input 43 of the fourth register, write control input 44 of the shift register 7, write 
control input 45 of the first FIFO 9, write 46 and read 47 control inputs of the second FIFO 1 0 and described above con- 

30 trol inputs 26 and 27 of the calculation unit 18 are respective control inputs of the neural processor. State outputs 48 of 
the first FIFO 9 and state outputs 49 of second FIFO 10 are state outputs of the neural processor. 
[0032] First data inputs of bits of the switch from 3 to 2 1 1 , data inputs of the first FIFO 9, of first 1 , second 2, third 
3 and fourth 4 registers and parallel data inputs of the shift register 7 are bit-by-bit coupled and connected to first input 
bus 35 of the neural processor, which bits of second input bus 36 are connected to second data inputs of the respective 

35 bits of the switch from 3 to 2 11 . First outputs of bits of the switch from 3 to 2 1 1 are connected to inputs of the respective 
bits of input operand vector 1 5 of the first saturation unit 13, control inputs 16 of bits of which are connected to output 
of the corresponding bits of the second register 2. Second outputs of bits of the switch from 3 to 2 1 1 are connected to 
inputs of the respective bits of input operand vector 1 5 of the second saturation unit 1 4, control inputs 1 6 of bits of which 
are connected to outputs of respective bits of the third register 3. Outputs of bits of the first register 1 are connected to 

40 first data inputs of respective bits of the multiplexer 12, second data inputs of bits of which are connected to outputs of 
respective bits of result vector 1 7 of the first saturation unit 13. Outputs of bits of the multiplexer 1 2 are connected to 
inputs of the respective bits of first operand vector 1 9 of the calculation unit 1 8, inputs of bits of second operand vector 
20 of which are connected to outputs of the respective bits of result vector 17 of the second saturation unit 14. Data 
outputs of the first FIFO 9 are connected to inputs of the respective bits of third operand vector 21 of the calculation unit 

45 18, outputs of bits of first summand vector of results 28 of which are connected to inputs of respective bits of first sum- 
mand vector 31 of the adder circuit 30, inputs of bits of second summand vector 32 of which are connected to outputs 
of respective bits of second summand vector of results 29 of the calculation unit 18, inputs of data boundaries setting 
for first operand vectors and result vectors 22 of which are connected to outputs of the respective bits of the fifth register 
5 and to the respective inputs of data boundaries setting for summand vectors and sum vectors 33 of the adder circuit 

so 30, outputs of bits of sum vector 34 of which are connected to respective data inputs of the second FIFO 1 0, which data 
outputs are connected to the respective bits of output bus 37 of the neural processor and to third inputs of the respective 
bits of the switch from 3 to 2 1 1 . Outputs of bits of the fourth register 4 are connected to data inputs of the respective 
bits of the fifth register 5 and to the respective inputs of data boundaries setting for third operand vectors 24 of the cal- 
culation unit 18. inputs of data boundaries setting for second operand vectors 23 of which are connected to output of 

55 the respective bits of the sixth register 6, which data inputs are connected to outputs of the respective bits of the shift 
register 7, which sequential data input and output are coupled and connected to first input of load control of third oper- 
and vectors into the first memory block 25 of the calculation unit 18 and to first input of the AND gate 8, which output is 
connected to read control input of the first FIFO 9. Shift control input of the shift register 7 is connected to second input 
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of the AND gate 8 and to second input of load control of third operand vectors into the first memory block 26 of the cal- 
culation unit 1 8, which input of reload control of third operand matrix from the first memory block to the second memory 
block 27 is connected to control inputs of fifth 5 and sixth 6 registers. 

[0033] The neural processor executive units are the first 13 and the second 14 saturation units, the calculation unit 
18 and the adder circuit 30. Each of these units executes operations over vectors of programmable word length data in 
two's complement presentation. . 
[0034] In each clock cycle of the neural processor operation the calculation unit 18 generates a partial product of 

the multiplication of the vector 



Y = (Y, Y 2 



Y K ), 



15 



20 



which bits are supplied to inputs 20 of the calculation unit 18, by the matrix 



z = 



(7 7 



Z A 



>• z 



2.M 



25 previously loaded and stored in the second memory block of the calculation unit 1 8, with addition to the obtained prod- 
uct of the vector 

X = (X, X 2 - X M ), 

which bits are supplied to inputs 19 of the calculation unit 18. And on outputs 28 and 29 of the calculation unit 18 bits of 

A = (A, A 2 - A M ) 



30 



35 



40 



and 

B = (B, B 2 - 

vectors are generated, which sum is the result of the operation 

X + YxZ. 



B M ) 



45 



I.e. the sum of the m-th elements of vectors A and B is defined by the following expression: 



50 



A m + B m =X m + XY k xZ km (m-1.2 M). 

k=1 



[0035] Vector X is an N-bit word of M packed data in two's complement presentation, which are elements of this 
vector And the last significant bits of vector X are bits of the first datum X A , then bits of the second datum X 2 are fol- 
55 lowed, etc. The most significant bits of vector X are bits of the M-th datum X M . With such packing the v-th bit of the m- 
th datum X m is the 
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(v + 2X)-th 

5 

bit of vector X, where N m - the word length of the m-th datum X m of vector X, v=1,2 N m , m=1,2 M. The number of 

data M in vector X and the number of bits N m in the m-th datum X m of this vector may be any integer value from 1 to N, 
where m=l l 2,...,M. The only restriction is that the total word length of all the data, packed in one vector X, should be 
10 equal to its word length: 



M 

Z N m = N. 

m=l 

75 

[0036] Vector Y is an N-bit word of K packed data in two's complement presentation, which are elements of this vec- 
tor. Format of vector Y is the same as that of vector X. However, these vectors may differ in the number of elements and 
word length of separate data, packed in these vectors. The minimal word length J of each datum, packed in vector Y, is 
defined by the hardware implementation of the multiplication in the calculation unit 18. When the algorithm of partial 
20 products is implemented, J is equal to 1 , when the modified Booth's algorithm is implemented, J is equal to 2. The 
number of bits N K in the k-th datum Y k of vector Y may be any integer value from J to N that is multiple of J, where 
k=1 ,2,...,K. The number of data K in vector Y may be any integer value from 1 to N/J. However, the total word length of 
all the data, packed in one vector Y, should be equal to its word length: 

25 K 

Z N K = N - 

k=1 



30 [0037] The k-th row of matrix Z is a data vector 

Z k = (^t.i ... Z^ M ), 

35 where k=1 ,2 K. And each of the vectors Zj , Z 2 Z K should have the same format as that of vector X. 

[0038] Vectors A and B, generated at the outputs 28 and 29 of the calculation unit 1 8, have the same format as that 
of vector X. 

[0039] Tuning of the calculation unit 18 hardware to process vectors of the required formats is made by means of 
loading the N-bit control word H to the fifth register 5, which outputs are connected to inputs 22 of the calculation unit 
40 18, and the (N/J)-bit control word E to the sixth register 6, which outputs are connected to inputs 23 of the calculation 
unit 18. 

[0040] The value 1 of the n-th bit h n of the word H means that the calculation unit 18 will regard the n-th bit of each 
of the vectors X, Zj , Z 2 , ... . Z K as the most significant (sigh) bit of the corresponding element of this vector. The number 
of bits with the value 1 in the word H is equal to the number of elements in each of the vectors X, Z 1 , Z 2 , ... , Z K : 

45 

N 

£ h n = M - 

n=»1 



[0041 ] The value 1 of the i-th bit ej of the word E means that the calculation unit 1 8 will regard the i-th J -bit group of 
bits of vector Y as a group of last significant bits of the corresponding element of this vector. The number of bits with the 
value 1 in the word E is equal to the number of elements in vector Y: 

55 N/J 

t=1 



V 9 
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[0042] Before the calculation unit 1 8 may operate as described above, the procedure of loading the matrix Z to the 
second memory block of the calculation unit 18 and the control words H and E to the fifth 5 and the sixth 6 registers 
respectively should be executed. This procedure is executed for several stages. 

[0043] Initially vectors 2 h Z 2 , ... , Z K are sequentially written to the first FIFO 9 from first input bus 35 of the neural 
5 processor The whole matrix Z is loaded to the first FIFO 9 per K clock cycles, in each of them the active signal of the 
first FIFO 9 write control is applied to input 45 of the neural processor. 

[0044] Then the control word H is loaded to the fourth register 4 from first input bus 35 of the neural processor, and 
in order to do that an active signal, enabling to write to the fourth register 4, is applied to input 43 of the neural processor 
during one clock cycle. At the next clock cycle the control word E is loaded to the shift register 7 from first input bus 35 
io of the neural processor, and in order to do that an active signal, enabling to write to the shift register 7, is applied to input 
44 of the neural processor during one clock cycle. 

[0045] During the next N/J clock cycles the matrix Z is moved from the first FIFO 9 to the first memory block of the 
calculation unit 18. In each of these N/J clock cycles an active control signal is applied to the neural processor control 
input connected to the shift control input of the shift register 7, to one of the inputs of AND gate 8 and to the input 26 of 

15 the calculation unit 1 8. In each clock cycle this signal initiates a shift of the shift register 7 contents of one bit right and, 
hence, the transmitting the next bit of the control word E to its serial output. The signal from the serial output of the shift 
register is applied to the control input 25 of the calculation unit 18 and to one of the inputs of the AND gate 8. With the 
value 1 of this signal an active signal is generated at the output of the AND gate 8, which supplies the read control input 
of the first FIFO 9. As a result of that one of the vectors Z-,, Z 2 , ... , Z K is applied to the inputs 21 of the calculation unit 

20 1 8 from the first FIFO 9 and this vector is written to the first memory block of the calculation unit 1 8. The number of clock 
cycles, necessary to load one vector Z K , depends on the word length Nk of the operand Y k , included in. vector Y, and 

is equal to N * /J (k=1 ,2 K). During matrix Z loading to the first memory block of the calculation unit 18 the control 

word H, stored all this time in the fourth register 4, is applied to inputs 24 of the calculation unit 1 8 with the purpose of 
tuning its hardware for receiving vectors Z 1( Z 2 Z K of the required format. Since the signal from the serial output of 

25 the shift register 7 is applied also to its serial data input and since the word length of the shift register 7 is equal to N/J, 
when the process of matrix Z loading to the first memory block of the calculation unit 18 is complete, the shift register 
7 will contain the same data as before this process, i.e. the control word E. 

[0046] After that an active signal is applied to the neural processor control input, connected to the control input 27 
of the calculation unit 18 and to control inputs of the f ifth 5 and the sixth 6 registers. As a result of that the matrix Z is 
30 loaded from the first block to the second memory block of the calculation unit 18, the control word H is rewritten from 
the fourth register 4 to the fifth register 5, and the control word E is rewritten from the shift register 7 to the sixth register 
6 per one clock cycle. 

[0047] Starting from the next clock cycle the calculation unit 1 8 will perform the described above operation in every 
clock cycle 



35 



55 



A+B = X + YxZ. 



40 [0048] The adder circuit 30 executes the addition of vectors A and B, applied to its inputs 31 and 32 from outputs 
28 and 29 of the calculation unit 18 in each clock cycle. And at the outputs 34 of the adder circuit 30 vector 

S = (S, S 2 - S M ) 

45 

is generated, which m-th element is equal to the sum of the m-th elements of vectors A and B: 

S r n = A m + B m (m-U...,M). 

so [0049] And vector S will have the same format as vectors A and B. Tuning of th$ adder circuit 30 hardware to proc- 
ess vectors of the required formats is provided by means of supplying the control word H, stored in the fifth register 5, 
to the inputs 33 of the adder circuit 30. > . 

[0050] Thus, the sequential connection of the calculation unit 1 8 and the adder circuit 30 allows to execute the oper- 
ation 



S=X+YxZ 



10 

BNSDOCID: <EP 1014Z74A1 J_> 



EP 1 014 274 A1 



over vectors of the programmable word length data in each clock cycle. The results of this operation over different sets 
of input operands vectors are written to the second FIFO 10, functioned us intermediate result accumulator, and in 
order to do that the signal, enabling to write to the second FIFO 1 0, is applied to input 46 of the neural processor. 
[0051] The calculation unit 18 and the adder circuit 30 can be used as a one-cycle switch of K data, packed in one 
s N-bit vector Y, applied to inputs 20 of the calculation unit 18, to M data, packed in one N-bit vector S, generated at the 
outputs 34 of the adder circuit 30. Such switching is performed by means of the operation 



S=X+YxZ 



w 



15 



20 



execution, where vector X is applied to inputs 19 of the calculation unit 18, all bits of this vector are zero values, and in 
the second memory block of the calculation unit 18 matrix Z is stored, which defines the switching rules. And matrix 2 
should satisfy the following requirements: the element 2^ m , located at the intersection of the k-th row and of the m-th 
column of matrix Z, should have the value 1 - (00...01)b, if it is required that the m-th element S m of vector S is equal to 
the k-th element Y k of vector Y, or the value 0 - (00... 00)b otherwise; vector Z K , which is the k-th row of matrix Z, should 
be of the same format as vector S; and each column of matrix Z should contain not more than one element having the 

value 1 (k=1,2 K; m=1,2,... M). The described above procedure of loading the control word H, defining the required 

format of vector S, to the fifth register 5, the control word E, defining the required format of vector Y, to the sixth register 
6 and matrix Z, defining the commutation rules, to the second memory block of the calculation unit 18 should forego 
before the switching operation. « 
[0052] The operation 



25 



30 



S=X+YxZ 

is executed per one clock cycle, while the process of matrix Z loading to the first memory block of the calculation unit 
18 occupies not less than N/J clock cycles. So the effective usage of the neural processor computing resources is 
achieved only when data vector packages are processed, and to support that the second memory block is incorporated 
to the calculation unit 18 and not a register but a two-port FIFO is used as the intermediate result accumulator 10. 
[0053] At package processing the set of input operands vectors, applied sequentially to each of the inputs 19 and 
20 of the calculation unit 18, is split into successively processed subsets (packages). The set of input operands vectors, 
applied sequentially to each of the inputs 19 and 20 of the calculation unit 18 and included to the x-th package, can be 
presented in the form of a vector of data vectors: 



35 



40 



/ yt.1 \ 
yO 



45 



where T T - the number of vectors, included into every x-th package. And all vectors in one package should have the 
same format, i.e. the contents of the fifth 5 and the sixth 6 registers should remain unchanged during one package 
processing. 

[0054] Processing of the x-th packages X T and is executed per T T clock cycles. And at the t-th clock cycle the 
calculation unit 18 and the adder circuit 30 execute the operation 



50 



=X xt +Y xl xZ T 



<t=l,2,...,T0, 



where Z 1 - the contents of the second memory block of the calculation unit 18, which should remain unchanged during 
55 the x-th packages X T and Y 1 processing. The whole process of the x-th packages X T and Y° processing can be consid- 
ered as the procedure of multiplication of the data matrix Y* by the data matrix T with accumulation of the results. 
[0055] Simultaneously with the x-th vector package processing the described above procedure of successive load- 
ing of the control word H T+1 , defining the format of vectors of the (x+1)-th package X x+1 , to the fourth register 4, of the 



11 
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control word E T+1 , defining the format of vectors of the (x+1)-th package Y^ 1 , to the shift register 7 and moving of the 
matrix Z T+1 from the first FIFO 9 to the first memory block of the calculation unit 18 is executed. And it is necessary to 
load new values to the fourth register 4 only if vectors of the (x+1)-th package X T+1 differ in format from vectors of the 
x-th package X\ and it is necessary to load new values to the shift register 7 only if vectors of the (x+1 )• th package Y* +1 

5 differ in format from vectors of the x- th package Y\ This procedure occupies not more than N/J+2 clock cycles. 

[0056] When both of the mentioned processes are complete, an active signal, initiating simultaneous move of the 
word H T+1 from the fourth register 4 to the fifth register 5, of the word E t+1 from the shift register 7 to the sixth register 
6 and of matrix Z* +1 from the first to the second memory block of the calculation unit 18, is applied to the neural proc- 
essor control input 27. All these moves are executed per one clock cycle. 

10 [0057] The number of vectors T t in every x-th package may be determine in program mode but it should not exceed 
the value T max that is equal to the number of cells in the second FIFO 10. On the other hand, it is not expedient to use 
packages of vectors with T T less than N/J+2, because in this case the neural processor computing facilities are not used 
efficiently. 

[0058] Simultaneously with the matrix Z T+1 move from the first FIFO 9 to the first memory block of the calculation 
75 unit 1 8 the successive loading of the third operands vectors that compose the matrixes Z t+2 , Z T+3 , etc. from the neural 
processor first input bus 35 to the first FIFO 9 may be executed. 

[0059] All the simultaneous processes are synchronized by means of analyzing the signals of state of the first 9 and 
the second 10 FIFOs, applied to the outputs 48 and 49 of the neural processor, and by means of control signals applied 
to the corresponding inputs of the neural processor. 

20 [0060] The switch from 3 to 2 1 1 and the multiplexer 1 2 form the commutation system, due to which the contents of 
the second FIFO 10 or data, supplied from one of the neural processor input buses 35 or 36, can be applied as to inputs 
of the first operand vector 19, as to inputs of the second operand vector 20 of the calculation unit 18. Besides, the con- 
tents of the register 1, previously written to it from the neural processor first input bus 35 by the active signal supply to 
the neural processor control input 40. can be applied to inputs 1 9 of the calculation unit 1 8. Selection of sources of data. 

25 applied to inputs 19 and 20 of the calculation unit 18, is made by means of setting a certain combination of signals on 
the neural processor control inputs 38 and 39. And if the data source is the second FIFO 10, then the signal, enabling 
to read from the second FIFO 10. should be applied to the neural processor control input 47. 

[0061] Data vectors, applied to inputs 19 and 20 of the calculation unit 18 from the second FIFO 10 or from one of 
the neural processor input buses 35 or 36, pass through saturation units 1 3 and 1 4. Each of the units 1 3 and 1 4 calcu- 
30 lates per one clock cycle the saturation function from each element of vector 

D = (D 1 D 2 - D L ), 

35 applied to inputs 15 of this device. 

[0062] Vector D is an N-bit word of L packed data in two's complement presentation, which are elements of this vec- 
tor. Vector D format is the same of that of described above vector X. However, these vectors can differ in the number of 
elements and the word length of separate data, packed in them. The minimal word length of data, composing vector D, 
is equal to two. The number of data L in vector D may be any integer value from 1 to N/2. However, the total word length 

40 of all the data, packed in one vector D, should be equal to its word length: 

45 

[0063] At outputs 1 7 of the saturation units 1 3 or 1 4 a vector 

F = (F, F 2 - F L ) 

is generated, which has the same format as that of vector D. And the X-th element F x of vector F is the result of calcu- 
lation of the saturation function over the X-th operand D x of vector D: 

55 F x = V 0x (D x ), 

where Q k is a parameter of the saturation function, calculated for the operand D x (X=1 ,2,...,L). The general view of the 
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saturation function calculated by the units 13 and 14, is presented in Fig.2 and may be described by the following 
expressions: 

V Q (D) = D, if -2° <;D<:2 Q -1; 

5 

Y Q (D) = 2 Q -1, if D>2 Q -1; 

Y Q (D) = -2 Q , if D<-2 Q 

10 [0064] The number of significant bits in the element F x of vector F without taking the sign bit into account is ecjual 
to the value of the parameter (X=1 ,2,...,L). It is obvious that the value Q x should be less than the word length of 
operands D x and Fx- 

[0065] Tuning of the hardware of each of the saturation units 1 3 or 1 4 for the required format of vectors D and F and 
also for the required values of parameters of the implemented saturation functions is made by means of setting an N- 
is bit control word U to control inputs 16 of said unit. 

[0066] And the bits of word U should have the following values: bits from the first to the (Q^-th - are the value 0 
each, bits from the (C^ + 1)-th to the (N i )-th - are the value 1 each, bits from the (N i + 1)-th to the (N \ + Q 2 )- - 
are the value 0 each, bits from the (N ] + Q 2 +1)-th to the (N i + N2 )-th - are the value 1 each, etc. In the general 
case the bits of U word from the 

20 

(i+2X)-th 

!i=l 

25 to the 

(Qx+ZN;>th 

30 

bit should be the value 0 each and from the 

(l + Qx+EN;>th 



40 



SO 



55 



to the 



45. bit- the value 1 each (X=1 ,2 L). 

[0067] If the value of the n-th bit of the word U is equal to 1 (u n =1) and the value of the (n+1)-th bit is equal to 0 
(u n+1 =0), then the saturation unit 13 or 14 will regard the, n-th bit of vector D as the most significant (sign) bit of the cor- 
responding element of this vector. The number of zero bits in the word U is equal to the total number of significant bits 
in all the elements of the vector of results F: 



n v ^ 



X = 1 



[0068] If U=(100...0)b, then d^ta from inputs 15 of the saturation unit 13 or 14 will pass to its outputs 17 without 
changes ■' ' • ; . . 
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30 



(F=D). 

r00691 The control word of the first saturation unit 1 3 is loaded from the neural processor first input bus 35 to the 

S register 2 which outputs are connected to control inputs 16 of the saturation unit 13. This toad ,s executed per 

one clock cycle by means of an active signal, applied to the control input 41 of the second reg.ster 2. 

[0070] The control word of the second saturation unit 14 is loaded from the neural processor ,rst .'^t bus !K i to 

the third register 3. which outputs are connected to control inputs 1 6 of the saturat.cn un.t 14. This load .s executed per 

one clock cvcle bv means of an active signal/applied to the control input 42 of the third register 3. 

?0071 1 ™e Son units 1 3 and 1 4 are an effective medium to prevent arithmetic overflow when the .nput oper- 

ZrT^i^^^on^s 13 or 14 allows to reduce only the number of signflicant bits in elements of the 
processed data vector. The word length of separate elements of the data vector and its format remain unchanged. At 
SiTs^me time in some cases it is expedient to calculate saturation functions for elements of data vector with reducing 
!ne wTd length of every element ofThe result vector by means of discarding all its most significant bits wh,ch are the 
extension of the sign bit of this element. Such a word length reduction of elements of the vector 

F = (F, F 2 ••• F L ), 

Generated at the outputs 17 of the saturation unit 14, and the repackaging of elements in vectors due to this reduction 
can be Sauted per one clock cycleby means of the calculation unit 18 and the adder circuit 30, which operate as^ data 
switch from 2L directions to L + 1 . As an example there is below a description of the vector F transformation to vector 

S = (S, s, - S L+1 ) 

generated at the outputs 34 of the adder circuit 30, where the X-th element is C\ + 1 of low-order (significant) bite of 

£ dement F, oi vector F (.=1.2 L), and the (L + 1)-th element S L+1 located ^ MS ' 

is equal to (00... 0)b. Vector F. generated at outputs 17 of the unrt 14, may be presented in the form of vector 

Y = (Y, Y 2 - YJ, 

applied to inputs 20 of the calculation unrt 1 8. where the f irst Y 2V1 and the second Y a , elements of the X-th pair of ele- 
ments are respectively Q x +1 of last significant and 



40 



45 



■ SO 



N x -Q x -1 



of most significant bits of the X-th -bit element F x of vector F (X=1 2 I* In <»*»"^^ bv matfxf 

are applied to inputs 1 9 of the calculation unit 18. and due to this fact the result of mult.pl.cat on of vector Y by matnxZ. 
M in the second memory block of the calculation unit 18. is generated at outputs 34 of the ^er c.curt m Th^ 
result will be the vector S of the required format, if the control word H. defin.ng the described above vector S format ,s 
*S fifth register 5. the control word E, defining the described above vector Y format - in «»«£"<^ 6 
and matrix Z contaWng L+1 elements in each of its 2L rows, -in the second memory block of the calculation unrt 18. 
Z mSTx zXuld saLy the foltowing requirements: the word length of each element of the X-th column of matnx Z 
ToZ I equal to Q x+ 1; *e element W located at the intersection o^he (2X-1)^ row and the -th column of 
matrix Z. should have the value 1 - (00...01)b. and the rest elements of matnx Z should have the value 0 - (00...00)b 

( [0073] If at execution of the described above operation of transforming vector F. generated at outputs 1 7 of the sat- 
uration unit 14. vector 



55 



X = (X, X, • X^,) 
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is applied to inputs 19 of the calculation unit 18, which the first element is equal to zero and has the word length 
equal to 

then vector 

10 

S = (S, s 2 ... S L+M ) 



r5 will be generated at the outputs 34 of the adder circuit 30, where the X-th element S x is Q^+1 of low-order (significant) 

bits of the X-Xh element F x of vector F (>.=1 ,2 L), and the (L+m)-th element is equal to the (m+1)-th element X m+1 of 

vector X (m=1,2 M). Thus, the neural processor allows to execute the saturation over elements of input data vector 

and to pack the obtained result to another input data vector per one clock cycle. 

[0074] The main function of the neural processor is emulation of various neural networks. In the general case one 
20 neural network layer consists of Q. neurons and has 0 neural inputs. And the oo-th neuron executes weighted summation 
of © data , C 2 C e> applied to the respective neural inputs with taking into account the neuron bias V m : 

25 a .i ■ . 



where W a |(D - a weight coefficient of the a-th input in the co-th neuron ($=1,2,...,0; ax=1 ,2,...,^). Then the a>-th neuron 
calculates the saturation function 

30 

over the result of weighted summation G m : 

35 

[0075] The general view of the saturation function, implemented by the neural processor, is presented in Fig.2. All 
input data, weight coefficients, bias values and results are presented as two s complements. 
40 [0076] The peculiarity of the offered neural processor is that the user can set the following neural network parame- 
ters in program mode: the number of layers, the number of neurons and neural inputs in each layer, the word length of 
data at each neural input, the word length of each weight coefficient the word length of output value of each neuron and 
the saturation function parameter for each neuron. 

[0077] One neural processor can emulate a neural network of a practically unlimited size. A neural network is emu- 

45 lated per layer (sequentially layer-by-layer). 

[0078] Each neural network layer is divided to sequentially processed fragments. This division is made in the fol- 
lowing way. The set of neural inputs of a layer is divided into groups so that the total word length of data applied to all 
inputs of each group of inputs is equal to the neural processor word length N. The set of neurons of a layer is divided 
into groups of neurons so that the total word length of the results of weighted summation of all input data for each neu- 

50 ron group is equal to the neural processor word length N. And the whole neural network layer is divided into fragments 
; of two types having different function. Each fragment of the first type executes weighted summation of data applied to 
all neural inputs, included into one group of inputs, for all neurons from one neuron group. Each fragment of the second . 
type generates output values for all neurons from one neuron group by means of calculation of the saturation function 
over the results of weighted summation of all input data. 

55 [0079] Fig.3 can be used as illustration of the described above principle of the neural network layer division into 
fragments. For this it is necessary to consider that each block, presented in Fig.3. executes operations over N-bit data 
vectors and to treat the designations in this figure in the following way: 
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C - a vector of data, applied to the &-th group of neural inputs (a=1 ,2 f ...,0); 

V -a vector of bias values of the ©-th neuron group (<o=1 ,2,... *K rta .,mn 
W,, j a ^ matS of weight coefficients of input data, applied to the group of neural inputs, in the co-th neuron 

arouD(a=1,2 0;e>=1.2 CI); r»\- ' 

G m - a result vector of input data weighted summation in the arth neuron group ,2 fi). 

f£ - a vector of output values of the o»-th neuron group (<a=1,2 SI). 

in Fig.3 a pair of devices, executing multiplication and addition, corresponds to each fragment of the first type, and one 
saturation unit corresponds to each fragment of the second type. 

SSm C whole neural network emulation process on one neural processor can be presented ,n the form of fi 
SSS *" ^Tr^^a^S during the neural netwod, layer .mHadon. N. nri. - ««*» 

M^yi * do* cy* of the executive phase of thefirst cooperation otthe »* neuron O^pnU.- 

=v.+c;xw,,, 

S 2 , ZZ* neuron emulation procedure. A control word is loaded from the neural processor f.rs input 
IE £fo rne'2m reira -7 Sn defines the format of data vectors applied to the 8-th group of n.ural inpuB. Malm 
T t Tr^n^^RFO 9. where this matrix shouW be previous., foad«< from ne»el processor ,«sf rnput 

the second operands vector 20 of the calculation unit 18 from the neural processor second .nput bus 36 (t-1 .2 T). 

And the calculation unit 18 and the adder circuit 30 form a partial sum vector 

Ge >0) =G^_ 1JD +Ce X W e<B , 

^^TJ^S^ eStmacro-operations of every emu.ation procedure of a neuron group the saturation 
K?3 n^y br^rthl reSriction of parfa. sums values in order to exdude the possibility of anthmetc overflow 
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during the input data weighted summation. In this case the preparatory phase of macro-operations should include the 
load of a control word to the second register 2 from the neural processor first input bus 35. 

[0087] The following operations are successively executed during the preparative phase of the (0+1)-th macro- 
operation (e=2,3,...,0) of the G>-th neuron group emulation procedure. A control word is loaded from the neural proces- 
s sor first input bus 35 to the third register 3, which defines the parameters of the saturation functions calculated for the 
co-th neuron group. Then control data, which is necessary for execution of compressing and packing of the results of the 
saturation function calculations, is loaded to the fourth register 4, to the shift register 7 and to the first memory block of 
the calculation unit 18. 

[0088] At every t-th clock cycle of the executive phase of the (0+1 )-th macro-operation of the procedure of the co-th 
10 neuron group emulation the partial sums vector Ge tG > applies to the inputs 15 of the saturation unit 1 4 from the second 
FIFO 10, and as a result of this the following vector is generated at the outputs 17 of the saturation unit 14 

is 

which then applies to the inputs 20 of the calculation unit 18. The calculation unit 18 and the adder circuit 30 compress 
vector Fli by means of removal of all the bits, which are the extension of the sign bit, from all its elements. If in this 
case not a zero vector is applied to the inputs 1 9 of the calculation unit 1 8, but a data vector from one of the neural proc- 
20 essor input buses 35 or 36, then the result of vector compression will be packed to that input data vector. The 
result, obtained at the t-th clock cycle of the executive phase of the (0+1)-th macro-operation of the (©-1)-th neuron 
group emulation procedure and stored in the external memory, may be used as such vector. The result is recorded to 
the second FIFO 10. 

[0089] When any macro-operation of a neural network fragment emulation is being executed, the change-over from 
25 the preparative phase to the executive one takes place by supply an active signal to the neural processor control input 
27 per one clock cycle, preceding the first clock cycle of the executive phase. And the contents of the fourth register 4 
is rewritten to the fifth register 5, the contents of the shift register 7 is rewritten to the sixth register 6 and the contents 
of the first memory block of the calculation unit 18 is moved to its second memory block. 

[0090] The successive execution of macro-operations is performed by the neural processor in the pipeline mode, in 
30 which the executive phase of the current macro-operation is made simultaneously with the preparative phase of the next 
macro-operation. The number of clock cycles, necessary for execution of all operations of the preparatory phase of a 
macro-operation, is within the range from N/J to N/J+4, depending on the number of control words, loaded to the neural 
processor registers. The number of clock cycles, necessary for the executive phase of any macro-operation is equal to 
the number of processed input data sets T, assigned by the user. Thus, the minimal period of a macro-operation exe- 
35 cution is determined by the preparative phase duration and is equal to the duration of N/J processor clock cycles. It is 
expedient to select the T value equal to N/J, because with smaller values than T the neural processor units will be not 
used efficiently, and with bigger values than T the time of the neural processor reaction to the next data set at the neural 
... inputs increases, what is undesirable for real-time neural network emulation. 

; [0091] . In the general case the process of emulation of a neural network layer, split into n x (0 + 1) fragments, for 
40 T input data sets is executed on one neural processor per Qx(0 + i)xT clock cycles, but not less than per 
Q x (0+ 1) x N/J clock cycles. 

[0092] A small neural network layer, where the total word length of data, applied to all neural inputs, and the total 
word length of the results of weighted summation for all neurons do not exceed the neural processor bit length N each, 
is emulated by execution of two macro-operations. The first macro-operation emulates the weighted summation of all 

45 input data for all neurons of the layer and the second one - calculation of saturation functions for all neurons of the layer. 
[0093] The presence of two input 35 and 36 and one output 37 buses in the neural processor allow to create effec- 
tive multiprocessor systems on its basis. A system consisting of S neural processors will emulate a neural network layer 
S times faster than one neural processor. In the extreme case every fragment of every layer may be emulated by a sep- 
arate neural processor. 

so [0094] The main unit of the neural processor is the calculation unit 1 8. 

[0095] Fig.4 discloses a block diagram.of one of possible implementations of the calculation unit 18 for execution of 
operations over, vectors of programmable word length data, comprising a shift register 50, performed the arithmetic shift 
of J bits left on all N-bit vector operands, stored in it, where J - minimal value that is the aliquot part of data word lengths 
in second operand vectors of the calculation unit 18, a delay element 51 . a first memory block 52. containing sequential 

55 input port and N/J cells to store N-bit data, a second memory block 53. containing N/J cells to store N-bit data. N/J mul- 
tiplier blocks 54, each of that multiply N-bit vector of programmable word length data by J-bit multiplier, and a vector 
adding circuit 55, generated partial product of the summation of N/J + 1 programmable word length data vectors, 
s [0096] : Inputs of third operand vector 21 bits of calculation unit 18 are connected to data inputs of the shift register 
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iCrfbbck 27 o"SSSruniU8. which every input of data boundaries setting for second operand vectors 23 . 
^^l^^^^^ oi the rejects multiplier block 54. Inputs of data boundaries settng for i.rs 
nno?Sve^ o r a nd for result vectors 22 of calculation unit 18 are connected to inputs of data boundar.es sett-ng for 
%S£££££* or result vectors of each multiplier block 54 and to inputs of data boundar.es sett.ng forsum 
man^ jSorsSd result vectors of the vector adding circuit 55. which outputs of bits of f.rst and second summand vec 
tors of results are respective outputs 28 and 29 of the calculation unrt 18. 

stages. 

[0099] At first per N/J clock cycles matrix Z is transformed into matrix 



25 



30 



z' = 



z 



N/J.I 



'\2 



J 2M 



35 



that is loaded to the first memory block 52 of the calculation unit 1 8. And the i-th row of matrix Z is a data vector 

zj^zj^ z u ... Z tM ), 



row 2^=1,2 K) of matrix Z with N k /J rows 



45 



50 



of matrix Z\ generated according to the expression: 

>KH> 



Z; +j =Z k x2*" 



0=1,2... .,N' k /J), 



55 



where l k - the total number of J-bit groups of bits in the k first operands of vector Y. N k - the word length of the Mh 
element Y k of vector Y. ' .• 

• -'■ ■ • . : l k =SN n /J. 
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[0100] 



From presented above expression it follows that 



z; =z 



=z 



N,/J+l 



5 



70 



z; 




and so on. It means that all rows of the matrix Z will be present in the matrix Z' but, as a rule, at other positions. 

[01 01 ] Matrix Z is transformed into matrix Z* by means of the shift register 50 per N/J clock cycles. In each of these 

15 N/J clock cycles a clock signal is applied to the control input 26 of the calculation unit 18, and this clock signal supplies 
the clock input of the shift register 50, and the described above N-bit control word H is continuously applied to inputs 24 
of the calculation unit 1 8, and this control word supplies inputs of data boundaries setting for operand vectors of the shift 
register 50. At the i-th clock cycle (i=1 ,2,...,N/J) the i-th bit ej of the described above (N/J)-bit control word E is applied 
to the control input 25 of the calculation unit 18. This signal supplies the mode select input of the shift register 50. 

20 [0102] Atthe(l k .-,+1)-th clock cycle (k=1,2,...,K), when a bit of the word E of the value 1 is applied to the input 25 of 
the calculation unit 18, the shift register 50 changes its mode to load of vector Z k , applied to the inputs 21 of the calcu- 
lation unit 18. At each of the rest N/J-K clock cycles, when a bit of the word E of the value 0 is applied to the input 25 of 
the calculation unit 18, the shift register 50 will execute an arithmetic shift of J bits left on the data vector, stored in it. 
[0103] Thus, when the i-th clock cycle (i=1 ,2...., N/J) of the process of matrix Z transform into matrix Z' is finished, 

25 vector Zj will be stored in the shift register 50. Data from the outputs of the shift register 50 applies to data inputs of 
the first memory block 52, containing sequential input port. 

[0104] The clock signal, applied to the input 26 of the calculation unit 18 at each clock cycle during the whole proc- 
ess of matrix Z transform into matrix Z\ pass through the delay element 51 . which may be a usual inverter gate, to the 
write control input of the first memory block 52 of the calculation unit 18. So matrix Z loading to the first memory block 
30 52 of the calculation unit 18 will take place simultaneously with matrix Z transformation into matrix Z\ At the end of the 

loading process vector Z \ (i=1 ,2 N/J) will be stored in the i-th cell of the first memory block 52 of the calculation unit 

18. - 

[01 05] After that clock signal is applied to the control input 27 of the calculation unit 1 8 per one clock cycle, and due 
to this signal the contents of all cells of the first memory block 52 is rewritten to the corresponding cells of the second 
35 memory block 53 of the calculation unit 18. Thus, matrix Z* is moved from the first 52 to the second 53 memory block 
of the calculation unit 18 per one clock cycle. 

[0106] Starting from the next clock cycle the executive unites of the calculation unit 18, which are multiplier blocks 
54 and the vector adding circuit 55, will generate a partial product of the operation 



in each clock cycle. And the i-th multiplier block 54 uses for generating a partial product of the multiplication of vector 
Z i , stored in the i-th cell of the second memory block 53 of the calculation unit 18, by the i-th group of bits y! of vector 
45 Y, applied to the inputs 20 of the calculation unit 18: 



[01 07] The control word E is applied to the inputs 23 of the calculation unit 1 8. and the j-th bit ej of this word supplies 
sign correction input of the (j-1)-th multiplier block 54 0=2,3,..;, N/J). Last significant bit e t of the control word E applies 
to sign correction input of the (N/J)-th multiplier block 54. So each, multiplier block 54, where a group of most significant 
55 bits of one of the elements of vector Y is applied to the inputs of multiplier bits, will perform the multiplication in two's 
complement presentation. The rest of N/J-K multiplier blocks 54 will operate on sign-and-magnitude presentation. 
[0108] The vector adding circuit 55 generates a partial product of the summation of partial products , P 2 , .., P N /j 
and of the vector X, applied to inputs 19 of the calculation unit 18. This circuit may be designed on the basis of carry 
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X + Y * Z 
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save adders. , 
[01 09] The control word H is applied to the inputs 22 of the calculation unit 1 8, and this control word suppl.es the 
inputs of data boundaries setting for multiplicand vectors of every multiplier block 54 and the inputs of data boundaries 
setting for summand vectors of the vector adding circuit 55. In this case in each executive unit of the calculation unrtl 8 
5 the carry propagation between the bits of these units, which process different elements of input vectors, will be locked. 
[0110] At the outputs of the vector adding circuit 55 vectors A and B are generated and their sum is equal to 

vta wi 
A+B = X+]£P j = X+^X xZ i • 

[0111] Having grouped the partial products, referring to separate elements of vector Y, the last expression can be 
15 presented in the following form 



20 



25 



k-1 j-l j-l 



[01 1 2] Taking into account the fact that every k-th element of vector Y is equal to 



35 



45 



SO 



30 the previous expression will be transformed as follows: 

A+B = X + £Y k xZ k . 



k-l 



[0113] Thus, partial product of the operation 
« X + YxZ ° 

is generated at the outputs 28 and 29 of the calculation unit. • i 

[0114] In the general case the clock cycle time is defined by the total propagation delay of the successively con- 
nected the switch from 3 to 2 1 1 , the saturation unit 14, the calculation unit 18 and the adder circuit 30. The neural proc- 
essor performance can be essentially increased if to use saturation units 13 and 14. comprising input data registers, 
which data inputs are connected to inputs 15 of these units, the calculation unit 18. comprising an input data register 
which data inputs are connected to inputs 19 and 20 of the calculation unit, the adder circurt 30. compns.ng an .nput 
data register, which data inputs are connected to inputs 31. 32 and 33 of the adder circuit The presence of *f"W* 
ters in the neural processor executive units allows to process data in the pipeline mode, which prov.des parallel execu- 
tion of the following three processes in each dock cycle: generating by the calculation unrt 18 the partial product of the 
weighted summation of the current data set. addition of the partial product of the weighted summation of the previous 
data set on the adder circurt 30 and calculation of saturation functions for the next set of input operands on »e units 13 
and 14, As the maximal propagation delays of the saturation units 1 3 and 1 4. of the calculation unrt 18 and of the adder 
55 circuit 30 have approximately equal values, the incorporating of the pipeline registers allows to increase the neural proc- 
essor clock rate practically by three times. • • - i: 
[0115] The saturation unit, which block diagram is presented in Rg.5. has inputs of input operand vector 15 bits 
control inputs 16 and outputs of result vector 17 bits. Each of N bits 56 of said unit comprises first 57 and second 58 
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multiplexers, an EXCLUSIVE OR gate 59, an EQUIVALENCE gate 60, a NAND gate 61 and an AND gate with inverted 
input 66. Said unit includes also a carry propagation circuit 63. comprising a initial carry input 64, inverted inputs of the 
carry propagation through separate bits 65, carry inputs from separate bits 66 and outputs of the carry to separate bits 
67, and a carry look-ahead circuit 68, comprising a initial carry input 69, inputs of the carry propagation through sepa- 
rate bits 70, carry generation inputs of separate bits 71 and outputs of the carry to separate bits 72. 
[01 1 6] Second data inputs of the first 57 and second 58 multiplexers and first input of the EXCLUSIVE OR gate 59 
of each bit 56 of said unit are coupled and connected to input of the respective bit of input operand vector 15 of said 
unit, which output of each bit of result vector 1 7 is connected to output of the first multiplexer 57 of the respective bit 56 
of said unit Non inverted input of the AND gate with inverted input 62 and fist inputs of the NAND gate 61 and the 
EQUIVALENCE gate 60 of each bit 56 of said unit are coupled and connected to the respective control input 16 of said 
unit. First input of the EXCLUSIVE OR gate 59 and non inverted input of the AND gate with inverted input 62 of q-th bit 
56 of said unit are respectively connected to second input of the EXCLUSIVE OR gate 59 and to inverted input of the 
AND gate with inverted input 62 of (q-1)-th bit of said unit, first data input of the second multiplexer 58 of which is con- 
nected to output of the carry to (N<| + 2)- th bit 67 of the carry propagation circuit 63 (where q =2, 3,..., N). Output of 
the NAND gate 61 of n-th bit 56 of said unit is connected to input of carry propagation through (N-n + 1)-th bit 70 of the 
carry look-ahead circuit 68, which output of the carry to (N-n + 2)-th bit 72 is connected to control input of the first mul- 
tiplexer 57 of n-th bit 56 of said unit, output of the AND gate with inverted input 62 of which is connected to control input 
of the second multiplexer 58 of the same bit 56 of said unit, to carry generation input of (N-n + 1)-th bit 71 oif the carry 
look-ahead circuit 68 and to inverted input of the carry propagation through (N-n + 1)-th bit 65 of the carry propagation 
circuit 63, which carry input from (N-n + 1)-th bit 66 is connected to output of the second multiplexer 58 of n-th bit 56 of 
said unit (where n = 1,2,..., N). In each bit 56 of said unit output of the second multiplexer 58 is connected to second 
input of the EQUIVALENCE gate 60, which output is connected to first data input of the first multiplexer 57, and output 
of the EXCLUSIVE OR gate 59 is connected to second input of the NAND gate 61. Second input of the EXCLUSIVE 
OR gate 59, inverted input of the AND gate with inverted input 62 and first data input of the second multiplexer 58 of N- 
th bit 56 of said unit, initial carry input 64 of the carry propagation circuit 63 and initial carry input 69 of the carry look- 
ahead circuit are coupled and connected to "0°. 

[0117] As circuits 63 and 68 in the saturation unit various carry propagation circuits and carry look-ahead circuits, 
applied in parallel adders, may be used. 

[01 18] In the simplest variant of carry propagation circuit 63 impiementation output of the carry to q-th bit 67 is con- 
nected to carry input from (q-1)-th bit 66 (where q = 1 .2 N). 

[0119] The saturation unit operates as follows. 
[0120] Bits of the input operand vector 

D = (D 1 d 2 DJ 



are applied to the inputs 1 5 of said unit Vector D is an N-bit word of L packed data in two's complement presentation, 
which are elements of this vector. And last significant bits of vector D are bits of the first datum D 1 , then bits of the sec- 
ond datum D 2 follow, etc. Most significant bits of vector D are bits of the L-th datum D L . With such packing the v-th bit 
of the X-th datum O x is the 

(v + ZN^-th 



bit of vector D t where N K - the word length of the X-th datum D x of vector D, v=1 ,2,..., N x , X=1 ,2,.;.,L. : 

[0121] The. minimal word length of data, composing vector D, is equal to 2. In the general case the number of bits 

N x in the X-th datum D x of vector D may be any integer value from 2 to N (X=1 ,2 L), and the.number of data L, packed 

in this vector, - from 1 to N/2. The only restriction is that the total word length of all data, packed in one vector D, should 
be equal to its word length: ■ . ■ t 

L 

X«1 
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[01 22] The unit is aimed to generate on the outputs 1 7 vector 

F = (F, F 2 - F L ), 

which the x-th element F x is the result of calculation of the saturation function of the X-th operand of vector D: 

F,= 4« Q) (D,), 

where Q> - a parameter of the saturation function, calculated for the operand D„ (X=1.2 L). The general view of the 

saturation function, calculated by said unit, is presented in Fig.2 and may be described by the following expressions: 

4» Q (D) = D if ■2 a sDs2°-1; 
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*q(D) = 2~-1 



if D>2 -1; 



4- Q (D) = -2 Q if D <-Z°. 

[0123] Vector F has the same format as that of vector D. The number of significant bits in the element F x of vector 
F without taking the sign bit into account is equal to the value of the parameter Q„, which should be less than the word 

length N, of the operands D x and F^(X.=l, 2 L). • j . ~ ~« 

[0124] Tuning of the hardware of said unit for the required format of vectors D and F and for the required values o 
parameters of the implemented saturation functions is made by means of setting an N-bit control word U to control 

[0125] 1 6 °And1h^bits of word U should have the following values: bits from the first to the Q A -th are the value 0 each, 
bits from the (Q« + 1)-th to the N n -th are the value 1 each, bits from the (IM 1+ 1)-th to the (N 1+ Q 2 )-th are the value 0 each, 
bits from the (N 1+ Q 2+ 1)-th to the (N 1+ N 2 )-th are the value 1 each. etc. In the general case the bits of U word from the 

»»-> 



to the 



(a*+IX)-th 



should be the value 0 each and bits from the 

(l+Qx+ZN,)-* 

45 ^ 



to the 



[0126] If the value of the n-th bit of the word U is equal to 1 (u n =l) and the value of the (n+1)-th bit is equal to 0 
(u i*0) then said unit will regard the n-th bit of vector D as the most significant (sign) bit of the corresponding element . 
of "this vetfor. The number of zero bits in the word U is equal to the total number of significant bits in all the elements of 
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the vector of results F 

[0127] The AND gate with inverted input 62 of the n-th bit 56 of said unit generates the signal g n = u n+1 a u n , 
which is the indicator that the n-th bit of said unit processes the sign bit of one of the input operands, composing vector 

D (hereafter ji=1, 2 N). The second multiplexer 58 of the n-th bit 56 of said unit generates the signal 

5 v^v^ Ag n vd n Ag n , which has the value of the sign (most significant) bit of the input operand, which bit is the 
n-th bit d n of vector D. 

[0128] With the purpose to accelerate the generation of signals v n for all bits 56 said unit the carry propagation cir- 
cuit 63 is used, which may be any known circuit with sequential or look-ahead carry, applied in usual parallel adders. It 
is characteristics of the carry propagation circuit 63 in the offered unit that signals v n are used as input and output carry 
10 signals and inverted values of signals g n are used as signals of carry propagation through separate bits. In this case 
the carry is propagated from most significant bits of said unit to last significant ones. 

[0129] The E XCLU SIVE OR gate 59 and the NAND gate 61 of the n-th bit 56 of said unit are used to generate the 
signal p n = u n vd n+1 ® d n , which is the indicator that the value of the n-th bit d n of vector D does not lead to the 
exceeding of saturation region determined by the word U for the input operand, which bit is the n-th bit d n of vector D. 

75 [01 30] The carry look-ahead circuit 68 generates for every n-th bit 56 of said unit the signal c n =c n+1 Ap n vg n , 
which is the indicator that the values of all bits of vector D from the n-th bit d n and up to the most significant bit of the 
input operand, which bit is the n-th bit d n of vector D, does not lead to the exceeding of the saturation region, determined 
by the word U for this input operand. Any known sequential or group carry generation circuit, applied in usual parallel 
adders, may be used as the circuit 68. It is characteristics of the carry look-ahead circuit 68 in the offered unit that sig- 

20 nals g n are used as carry generation signals, applied to inputs 71 , signals p n are used as carry propagation signals, 
applied to inputs 70, and signals c n are generated at the carry outputs 72. In this case the carry is propagated from most 
significant bits of said circuit to last significant ones. 

[0131] The EQUIVALENCE gate 60 and the first multiplexer 57 of the n-th bit 56 of said unit_generate the value of 
the n-th bit f n of the result vector F according to the expression f n = d n AC n v (v~^ (§ uj a c^ 7 If c n =1 . then at the 
25 output of the first multiplexer 57 the value of the bit d n of vector D is set; if c p =0 and u n =1 , then at the output of the first 
multiplexer 57 the non inverted value of the sign bit (v n ) of the corresponding operand of vector D is set; if c n =0 and 
u n =0, then at the output of the first multiplexer 57 the inverted value of the sign bit( v^ ) of the corresponding operand 
of vector D is set. The result vector bits, obtained at the outputs of the first multiplexers 57, supplies the outputs 1 7 of 
said unit. 

30 [0132] It is necessary to note that if the control word U=(100...0)b is applied to the inputs 16 of said unit then the 
data from inputs 15 of said unit will pass to its outputs 17 without changes 

35 

[0133] Thus, the offered saturation unit has a propagation delay approximately equal to the propagation delay of a 
usual parallel adder of two N-bit numbers. In this case said unit allows to simultaneously calculate saturation functions 
for several data, which word length may be programmed by the user. 

40 [0134] The calculation unit, which block diagram is presented in Fig. 7, comprises inputs of first 19, second 20 and 
third 21 operand vector bits, inputs of boundary setting for first operand vectors and result vectors 22, for second oper- 
and vectors 23 and for third operand vectors 24, first 25 and second 26 inputs of load control of third operand vectors 
into the first memory block, input of reload control of third operand matrix from the first memory block to the second 
memory block 27 and outputs of bits of first summand vector of results 28 and of second summand vector of results 29. 

45 Said unit includes a shift register 50, a delay element 51 , N/2 AND gates with inverted input 75, N/2 decoders of multi- 
plier bits 76, a multiplier array 77 of N columns by N/2 cells in each column. Any bit of the shift register 50 consists of 
an AND gate with inverted inputs 78, a multiplexer 79 and a trigger 80. Each cell of the multiplier array 77 consists of 
first 81 and second 82 triggers, functioned us memory cells of respectively first and second memory blocks of said unit, 
an AND gate with inverted input 83, an one-bit partial product generation circuit 84, an one-bit adder 85 and a multi- 

so plexer 86. In Fig. 7 the columns of cells of the multiplier array 77 are numbered from right to left, and the cells of columns 
of the multiplier array 77 - from top downward- 

[0135] Input of each bit of first operand vector 19 of said unit is connected to second input of the one-bit adder 85 
of the first cell of the respective column of the multiplier array 77, first input of the one-bit adder 85 of each cell of which 
is connected to output of the one-bit partial product generation circuit 84 of the same cell of the multiplier array 77, con- 
55 trol inputs of multiplexers 86 and inverted inputs of the AND gates with inverted input 83 of all cells of each column of 
which are coupled and connected to respective input of data boundaries setting for first operand vectors and for result 
vectors 22 of said unit. Each input of data boundaries setting for second operand vectors 23 of said unit is connected 
to inverted input of the respective AND gate with inverted input 75, which output is connected to first input of the respec- 
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tive decoder of multiplier bits 76. Respective control inputs of the one-bit partial product generation circuits 84 of i-th 
cells of all columns of the multiplier array 77 are coupled and connected to respective outputs of i-th decoder of multi- 
plier bits 76 which second and third inputs are connected to inputs of respectively (2i-1)-th and (2.)-th bits of second 
operand vector 20 of said unit (where i = 1 .2 N/2). Non inverted input of j-th AND gate with inverted.input 75 .s con- 
nected to third input of G-D-th decoder of multiplier bits 76 (where j = 2. 3 N/2). Input of each bit of th.rd operand 

vector 21 of said unit is connected to second data input of the multiplexer 79 of the respective bit of the shitt registerSO. 
which first data input is connected to output of the AND gate with inverted inputs 78 of the same bit of the shift register 
50 which first inverted input is connected to respective input of data boundaries setting for third operand vectors 24 of 
said unit Second inverted input of the AND gate with inverted inputs 78 of q-th bit of the shift register 50 is connected 

to first inverted input of the AND gate with inverted inputs 78 of (q-1 )-th bit of the shift register 50 (where q = 2, 3 N). 

Non inverted input of AND gate with inverted inputs 78 of r-th bit of the shift register 50 is connected to trigger 80 output 

of (r-2)-th bit of the shift register 50 (where r = 3. 4 N). Control inputs of multiplexers 79 of all shift register 50 bits are 

coupled and connected to first input of load control of third operand vectors into the first memory block 25 of sa.d unit. 
Clock inputs of triggers 80 of all shift register 50 bits and input of the delay element 51 are coupled and connected to 
second input of load control of third operand vectors into the first memory block 26 of said unit. Output of the multiplexer 
79 of each shift register 50 bit is connected to data input of the trigger 80 of the same bit of the shift reg.ster 50, which 
output is connected to data input of the first trigger 81 of the last cell of the respective column of the multiplier array 77. 
Output of the first trigger 81 of j-th cell of each multiplier array 77 column is connected to data input of the first trigger 

81 of (i-D-th cell of the same multiplier array 77 column (where j = 2. 3 N/2). Clock inputs of the first triggers 81 of 

all multiplier array 77 cells are coupled and connected to output of the delay element 51 . Clock inputs of the second 
triggers 82 of all multiplier array 77 cells are coupled and connected to input of reload control of third operand matrix 
from the first memory block to the second memory block 27 of said unit. Second data input of the one-bit partial product 
generation circuit 84 of i-th cell of q-th multiplier array 77 column is connected to output of the AND gate with inverted 

input 83 of i-th cell of (q-1)-th multiplier array 77 column (where i = 1 . 2 N/2 and q = 2. 3 N). Second I input o the 

one-bit adder 85 of j-th cell of each multiplier array 77 column is connected to sum output of the one-bit adder 85 of the 
G-D-th cell of the same multiplier array 77. column (where j = 2. 3...,, N/2). Third input of the one-bit adder 85 of j-th cell 
of q-th multiplier array 77 column is connected to output of the multiplexer 86 of (j-1)-th cell of (q-1)-th multiplier array 

77 column (where j = 2. 3...., N/2 and q = 2, 3 N), and third input of the one-bit adder 85 of j-th cell of thef irst multiplier 

array 77 column is connected to third output of G-l)-th decoder of multiplier bits 76 (where j = 2, 3 N/2). 

[01 361 Sum output of the one-bit adder 85 of the last cell of each multiplier array 77 column is output of the respec- 
tive bit of first summand vector of results 28 of said unit. Output of the multiplexer 86 of the last cell of (q-1)-th mult.pl.er 
array 77 column is output of q-th bit of second summand vector of results 29 of said unit (where q - 2 3 N) wh.ch 
first bit of second summand vector of results 29 is connected to third output of (N/2)-th decoder of multiplier bits 76. Sec- 
ond inverted and non inverted inputs of the AND gate wijUi inverted inputs 78 of the first bit and non inverted input of the 
AND gate with inverted inputs 78 of the second bit of the shift register, second data inputs of the one-bit partial produrt 
generation circuits 84 of all cells of the first column of the multiplier array 77, third inputs of one-bit adders 85 of first 
cells of all multiplier array 77 columns and non inverted input of the first AND gate with inverted input 75 are coupled 
and connected to "0". In each multiplier array 77 cell the output of the first trigger 81 is connected to data input of the 
second trigger 82. which Output is connected to non inverted input of the AND gate with inverted input 83 and to firs 
data input of the one-bit partial product generation circuit 84, which third control input is connected to second data input 
of the multiplexer 86, which first data input is connected to carry output of the one-bit adder 85 of the same cell of the 

multiplier array ^ jg tf ^ fa generate a product ^ the multiplication of the second operand vector 

Y = (Y, Y 2 ••• Y K ), 

which bits are supplied to inputs 20 of said unit, hy the third operand matrix 



z = 
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45 



previously loaded and stored in the second memory block of said unit, with addition to the obtained product of the first 
operand vector 

X = (X, X 2 X M ), 
which bits are supplied to inputs 19 of said unit. In each clock cycle on outputs 28 and 29 of said unit bits of 

A = (A, A, ... A M ) 

and 

B=(B, B 2 ... B M ) 

vectors are generated, which sum is the result of the operation 

X + YxZ. 

I.e. the sum of the m-th elements of vectors A and B is defined by the following expression: 

K 

A m +B m =X m+ £ Y k x2 kim (01=1,2 M). 



[0138] Vector X is an N-bit word of M packed data in two's complement presentation, which are elements of this, 
vector. And the last significant bits of vector X are bits of the first datum X t , then bits of the second datum X 2 are fol- 
30 lowed, etc. The most significant bits of vector X are bits of the M-th datum X M . With such packing the v-th bit of the ru- 
th datum X m is the 

(v + 2X)"* 

35 . H-l 

bit of vector X, where N m - the word length of the m-th datum X m of vector X, v=1 ,2 N m , m=1 ,2,...,M. The number of 

data M in vector X and the number of bits N m in each m-th datum X m of this vector (m=1 ,2....,M) may be any integer 
value from 1 to N. The only restriction is that the total word length of all the data, packed in one vector X, should be 
40 equal to its word length: 

M 



m=1 



[01 39] Vector Y is an N-bit word of K packed data in two's complement presentation, which are elements of this vec- 
tor. Format of vector Y is the same as that of vector X. However, these vectors may differ in the number of elements and 

word length of separate data, packed in these vectors. The number of bits N « in the k-th datum Y k (k«1 ,2 K) of vec- 

50 tor Y may be any integer value from 2 to N. The number of data K in vector Y may be any integer value from 1 to N/2. 
However, the total word length of all the data, packed in one vector Y, should be equal to its word length: 

K 

Z N 'k = N- 

55 k»1 

[0140] The k-th row of matrix Z is a data vector 
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5 where k=1 ,2,...,K. And each of the vectors Z A , Z 2 Z K should have the same format as that of vector X. 

[0141] Vectors A and B, generated at the outputs 28 and 29 of said unit, have the same format as that of vector X. 
[0142] Tuning of the calculation unit hardware to process vectors of the required formats is made by means of load- 
ing the N-bit control word H to the inputs of data boundaries setting for first operand vectors and result vectors 22 of 
said unit and the (N/2)-bit control word E to the inputs of data boundaries setting for second operand vectors 23 of said 

io unit. 

[0143] The value 1 of the n-th bit h n of the word H means that said unit will regard the n-th bit of each of the vectors 
X, Z 1t Z 2 , ... , Z K as the most significant bit of the respective element of this vector. The number of bits with the value 1 
in the word H is equal to the number of elements in each of the vectors X, Z 1t Z 2 Z K : 

15 N 

Z h n = M ' ' 
n=1 



20 [0144] The value 1 of the i-th bit ej of the word E means that said unit will regard the i-th pair of bits of vector Y as 
a group of last significant bits of the respective element of this vector. The number of bits with the value 1 in the word E 
is equal to the number of elements in vector Y: 



25 



N/2 



[0145] Before the described above operation may be executed, the procedure of loading the matrix Z to the second 
30 memory block of said unit, functions of memory cells of which are executed by the second triggers 82 of the multiplier 
array 77 cells, should be foregone. Said procedure executes at two stages. 
[0146] At first per N/2 clock cycles matrix Z is transformed into matrix 



35 



40 



Z = 



z; 



2.1 



H/2,1 



'1.2 



'N/Z2 



\ 



4 \M 



J 2M 



45 



which is loaded to the first memory block of said unit. And the i-th row of matrix Z' data vector 

Z t —(Z U , z^ ... Z iM ), 



which will be then multiplied by the i-th pair of bits of vector Y (i=»1,2 N/2). All vectors Zi ; Z 2 ..... Zn/j have the 

same format as that of any of vectors Z 1( Z 2 Z K . Matrix Z transforms into matrix Z' by replacement of the k-th row Z* 

so (k=1 f 2,...,K)of matrix Z with Nk /2 rows : ^ ^ i! : 



55 



of maitrix Z\ generated according to the expression: 
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Z It _ +J =Z k x2 2 «- ,) (j=l,2,-..,N' k /2), 

5 - 

where l k - the total number of J-bit pairs of bits in the k first operands of vector Y, which is equal to: 

k 

10 o=1 

[0147] From presented above expression it follows that 
is 2, l = Z,, Z N . /M =Z 2 , 



20 



45 



50 



z* = z 



and so on. I.e. all rows of the matrix 2 will be present in the matrix Z\ but, as a rule, at other positrons. 

25 [0148] Matrix Z is transformed into matrix Z' by means of the shift register 50, which has two operate modes. In the 
load mode the value 1 is applied to the control input 25 of said unit, and ail multiplexers 79 of the shift register 50 begin 
to pass data vector bits, applied to the inputs 21 of said unit, to data inputs of triggers 80 of the shift register 50. In the 
shift mode the value 0 is applied to the control input 25 of said unit, and all multiplexers 79 of the shift register 50 begin 
to pass data from the outputs of the corresponding AND gate with inverted inputs 78 of the shift register 50 to data 

30 inputs of triggers 80 of the shift register 50. At the output of the AND gate with inverted inputs 78 of the r-th bit 
(r=3,4,...,N) of the shift register 50 the signal w r . 2 a Frp a fi^rr is generated, where w r _ 2 - data, stored in the trigger 80 of 
the (r-2)-th bit of the shift register 50, and h r - the value of the r-th bit of the N-bit control word H, which is applied to the 
inputs 24 of said unit and sets data boundaries in the processing vectors. The AND gates with inverted inputs 78 pre- 
vents propagation of data between the shift register 50 bits, storing, bits of different elements of data vector, which is 

35 previously loaded to the shift register 50. At the output of the AND gate with inverted inputs 78 of the two last significant 
bits of the shift register signals of the value 0 are constantly generated, because their non inverted inputs are connected 
to "0". Thus, the. shift register 50, which is in the shift mode, performs the arithmetic shift of two bits left on the data vec- 
tor, stored in it what is equivalent to the multiplication of this vector by four. 

[0149] Matnx Z is transformed into matrix T per N/2 clock cycles. In each of these N/2 clock cycles a clock signal 
40 is applied to the control input 26 of said unit, and this clock signal supplies clock inputs of the triggers 80 of the shift 
register 50, and described above N-bit control word H is continuously applied to inputs of data boundaries setting for 
third operand vectors 24 of said unit and this control word will supplies inputs 22 of said unit at execution of the 



X + YxZ 

operation after matrix Z load. At the i-th clock cycle (i=1 ,2 N/2) the i-th bit e t of the described above (N/2)-bit control 

word E is applied to the control input 25 of said unit, and this control word will supplies inputs 23 of said unit at execution 
of the 

X + YxZ 



operation after matrix Z transform and load. 

55 [0150] At the (l k -i+1)-th dock cycle (k=1 ,2 K), when a bit of the word E of the value 1 is applied to the input 25 of 

said unit, bits of vector 2^ f are applied to the inputs 21 of said unit, and this vector will be written to the triggers 80 of 
the shift register 50 without changes. At each of the rest N/2-K clock cycles, when a bit of the word E of the value 0 is 
applied to the input 25 of said unit, four-times increased values of elements of the data vector, stored in the shift register 
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50, will be written to the triggers 80 of the shift register 50. 

[0151] Thus, when the i-th clock cycle (i=1 ,2 N/2) of the process of matrix 2 transform into matrix T is finished, 

vector 2 j will be stored in the triggers 80 of the shift register 50. 

[0152] Data from the outputs of the shift register 50 applies to data inputs of the first memory block 52 of said unit, 

5 which is implemented by the first triggers 81 of the multiplier ray 77 cells. The matrix of N by N/2 triggers 81 comprises 
N parallel connected (N/2)-bit shift registers, each of them consists of N/2 serially connected triggers 81 , included into 
cells of one of the multiplier ray 77 columns. So the matrix of triggers 81 can be considered as a memory block with 
sequential input port, containing N/2 memory cells, each of them provides storage of N-bit words. Functions of the i-th 
cell of the first memory block are executed by triggers 81 of the cells of i-th row of the multiplier array 77 (i=1,2,...,N/2). 

10 [0153] The clock signal, applied to the input 26 of said unit at each clock cycle during the whole process of matrix 
2 transform into matrix T , pass through the delay element 51 , which may be a usual inverter gate, to clock inputs of 
the first triggers 81 of all multiplier array 77 cells. So matrix 2* loading to the first memory block of said unit will take place 
simultaneously with matrix 2 transformation into matrix 2\ At the end of the loading process vector 2 j will be stored in 
the first triggers 81 of the i-th row of the multiplier array 77 (i=1 ,2,..., N/J). 

is [01 54] After that clock signal is applied to the control input 27 of said unit per one clock cycle, and due to this signal 
the contents of the first triggers 81 of all cells of the multiplier array 77 is rewritten to the second triggers 82 of the same 
ceils of the multiplier array 77. The matrix of N by N/2 triggers 82 can be considered as a second memory block con- 
taining N/2 memory cells, each of them provides storage of N-bit words. Functions of the i-th cell of the second memory 
block are executed by triggers 82 of the cells of i-th row of the multiplier array 77 (i=1 ,2,„..N/2). Thus, matrix 2' is moved 

20 from the first to the second memory block of said unit per one clock cycle. 

[0155] Starting from the next clock cycle the executive unites of the calculation unit which are AND gates with 
inverted input 75, decoders of multiplier bits 76 and also included into cells of the multiplier array 77 AND gates with 
inverted input 83, one-bit partial product generation circuits 84, one-bit adders 85 and multiplexers 86 will perform the 
described above operation in each clock cycle 

25 - 

A+B=X+YxZ. 



30 

[0156] In this case the i-th decoder of multiplier bits 76, the i-th AND gate with inverted input 75 and included into 
cells of i-th row of the multiplier array 77 AND gates with inverted input 83 and circuits 84 are used to generate bits of 
the partial product of the multiplication of vector 2 \ , stored in the second triggers 82 of cells of the i-th row of the mul- 
tiplier array, by the i-th pair of bits Y j of vector Y (hereafter i=1 ,2 N/2): 

35 - 



40 - ' 

[0157] All partial products are calculated on the basis of modified Booth's algorithm, according of that the values of 
2i-th and (2i-1)-th bits of vector Y and carry- signal c x from neighbouring low-order pair of multiplier bits determine the 
value of the partial product P\ as follows: > 

45 if y 2j =0, y 2 j-i=0 and Cj =0 or y 2 j=1, y2i-i=1 and Cj =1. then Pj=0; . 
if y2i=0, y2i-i=0 and q =1 or y 2 j=0, y 2i _i=1 and Cj =0, then 



if y 2 j=0, y 2M =1 and q =1 , then 

Pi= 2xz',; 

55 



if y 2 j=l, y2\-\=Q and c i =°. then 
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5 

'f v 2i= 1 . v 2i-1=° and c i = 1 or v 2i= 1 . V 2M= 1 and c i =0, then 

Pi=-z;. 

10 



[0158] In usual two-operand multipliers, operated on the basis of Booth's algorithm, the (2i-2)-th multiplier bit is 
used as the carry signal Cj In the offered unit, where the multiplicand is vector of programmable word length operands, 
is the carry signal Cj is generated at the output of the i-th AND gate with inverted input 75 and described as follows: 

C i =721-2 Ae i. 

where y 2 j-2 - (2i-2)-th bit of vector Y, ej -i-th bit of the control word E. Usage of AND gates with inverted input 75 allows 
20 to lock the carry propagation between to pairs of bits of vector Y, which are included to different elements of the vector. 
[0159] At outputs of i-th decoder of multiplier bits 76 following signals are generated: 



one i = y 2n ® c j( two, = y 2M e c ( a y 2M © y 2( , sub, = y 2l . 

25 [0160] These signals control the one-bit partial product generation circuits 84 of cells of i-th row of multiplier array 
77, to first data inputs of which from outputs of the second triggers 82 of cells of i-th row of multiplier array 77 bits of 2 j 
are applied, and to second data inputs from outputs of AND gates with inverted input 83 of cells of i-th row of multiplier 
array 77 bits of Z * are applied. The AND gate with inverted input 83 of i-th cell of n-th multiplier array 77 column gen- 
erates (n+1)-th bit z j, of vector Z j in accordance with following expression 

30 

z i.n + 1 = z i,n A h rv 

where z\ t - n-th bit of vector Zj , stored in the trigger 82 of i-th cell of n-th multiplier array 77 column, h n - n-th bit of 
control word H (i=1,2,...,N/2 and n=1,2 N). From said expression it follows that vector Z j is equal to 2 x z\ and has 

35 the same format as that of vector Z j . 

[0161] And gates 90 and 91 and OR gate 92, which are elements of the one-bit partial product generation circuits 
84 of cells of i-th row of multiplier array 77, operate as N-bit switch, to which outputs when onej=1 and twOj=0 vector Z\ 
passes, when one^l and two,=0 - vector z] , and when one,=0 and two^O) - vector with the value 0 in each its bit. 
Thus, at outputs of said switch a vector P j is generated, which is equal to partial product vector P j when subj=0, and 

40 to -P( when subj=1. 

[01 62] Changing of the sign of each element of vector P \ , necessary to obtain vector Pj when subp 1 , may be exe- 
cuted by means of inverting of each bit of vector Pj and addition the value 1 to each element of inverted vector. The 
EXCLUSIVE OR gates 93, which are elements of the one-bit partial product generation circuits 84 of cells of i-th row of 
multiplier array 77, operate as inverters, controlled by signal sub,. When subj=0, vector P i pass through the EXCLU- 
45 SIVE OR gates 93 to outputs of the one-bit partial product generation circuits 84 of cells of i-th row of multiplier array 
77 without changes. When subj=1 , the EXCLUSIVE OR gates 93 invert each bit of this vector. Thus, at outputs of the 
one-bit partial product generation circuits 84 of cells of i-th row of multiplier array 77 a N-bit vector P i is generated, 
which has the same format as that of vectors X, Z 1 , Z2 Z n/2 and satisfies the expression: 

50 m - 

P, +SUB,=Pi, 



where SUBj - N-bit vector, m-th element of which is N m -bit operand (00. ..0 subj)b, which last significant bit is equal to 
55 subj, and rest bits have the value 0 each. 

[0163] The one-bit adders 85 and the multiplexers 86 of cells of the multiplier array 77 are used to generate a partial 

product of the summation of vectors X, P i , P 2 P'w2 , SUB 1t SUB 2 SUB N/2 . In said unit the summation only is 

executed by means of one-bit adders 85, as in usual N/2+1 operand summation circuits, designed on the basis of carry 
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save adders. The multiplexers 86 are used for replacement of the carry signals between columns of one-bit adders 85, 

executed the summation of different elements of vectors, with the signals sub 1( sut>2 sub N/2 - lf(q-t)-th bit h q .-| of the 

control word H is equal to the value 0, then the multiplexers 86 of cells of (q-1)-th multiplier array 77 column pass the 
signals from carry outputs of one-bit adders 85 of cells of (q-1)-th column of the multiplier array 77 to respective inputs 
of one-bit adders 85 of cells of q-th column of the multiplier array 77 (q^2,3,....N). If (q-1)-th bit hq.-, of the control word 
H is equal to the value 1, then the multiplexers 86 of cells of (q-i)-th multiplier array 77 column pass the signals sub 1( 

SU b 2 sub N/2 from outputs of decoders of multiplier bits 76 to respective inputs of one-bit adders 85 of cells of q-th 

column' of the multiplier array 77 (q=2,3 N). As a result of this at the outputs 28 and 29 of the said unit vectors A and 

B are generated and their sum is equal to 

N/2 N/2 N/2 

A + B=X + £(Pf +SUB l ) = X+£P i =X+£ Y i xZl . 



[0164] Having grouped the partial products, referring to separate elements of vector Y, the last expression may be 
presented in the following form 

20 K N k /2 K N k /2 



25 



30 



35 



40 



45 



[0165] Taking into account the fact that every k-th element of vector Y is equal to 



N k /2 



the previous expression will be transformed as follows: 

K 



i _ 



A + B = X + ^Y lt xZ k . 



[01 66] Thus, the partial product of the operation 

X + YxZ 



is generated at the outputs 28 and 29 of said unit. 

[0167] The calculation unit is oriented to package processing of data vectors, with that the set of input operands 
vectors, applied sequentially to each of the inputs 19 and 20 of said unit is split into successively processed subsets 
so (packages). The set of input operands vectors, applied to each of the inputs 19 and 20 of said unit and included to the 
x-th package, can be presented in the form of a vector of data vectors: 
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X T = 



x ,.2 



Y x = 



10 



15 



20 



25 



30 
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where T T - the number of vectors, included into every x-th package. And all vectors in one package should have the 
same format, i.e. the information, applied to control inputs 22 and 23 of said unit, should remain unchanged during one 
package processing. 

[01 68] Processing of the x-th packages )C and is executed per T x clock cycles. And at the t-th dock cycle the cal- 
culation unit executes the operation 



+ B rt =X rt +Y rt xZ l 



(t-1,2, ...^ 



where Z* - the contents of the second memory block of said unit, which should remain unchanged during the x-th pack- 
ages X T and processing. 

[0169] Simultaneously with the x-th vector package processing the described above procedure of successive load- 
ing of vectors Z 1 + , Zl + . Zk + from inputs 21 of said unit to the first memory block of said unit. This procedure occu- 
pies N/2 clock cycles. 

[0170] When both of the mentioned processes are complete, an active signal, initiating move of matrix Z T+1 from 
the first to the second memory block of said unit, is applied to the neural processor control input 27. Said move is exe- 
cuted per one clock cycle. After that said unit will process (x+1)-th packets of vectors X T+1 and Y 1 * 1 and will load matrix 
Z~ 2 . 

[0171] The number of vectors T r in each x-th package may be set in program mode. And it is not expedient to use 
packages of vectors with T T less than N/J+2, because in this case the neural processor computing facilities are not used 
efficiently. 

[0172] The adder circuit, which block diagram is presented in Fig.9, has inputs of bits of first summand vector 31 
and of second summand vector 32, inputs of data boundaries setting for summand vectors and sum vectors 33 and out- 
puts of bits of sum vector 34. Each of N bits 94 of the adder circuit comprises a half-adder 95, an EXCLUSIVE OR gate 
96, first 97 and second 98 AND gates with inverted input. Also the adder circuit includes a carry look-ahead circuit 99. 
[01 73] Inputs of bits of first summand vector 31 of the adder circuit and inputs of bits of second summand vector 32 
of the adder circuit are connected respectively to first and second inputs of the half-adders 95 of bits 94 of the adder 
circuit. Inverted inputs of first 97 and second 98 AND gates with inverted input of each bit 94 of the adder circuit are 
coupled and connected to respective input of data boundaries setting for summand vectors and sum vectors 33 of the 
adder circuit. Outputs of the EXCLUSIVE OR gates 96 of bits 94 of the adder circuit are outputs of bits of sum vector 
34 of the adder circuit. Output of the first AND gate with inverted input 97 of each bit 94 of the adder circuit is connected 
to carry propagation input through the respective bit of the carry look-ahead circuit 99, which carry generation input in 
each bit is connected to output of the second AND gate with inverted input 98 of the respective bit 94 of the adder cir- 
cuit. Second input of the EXCLUSIVE OR gate 96 of q-th bit 94 of the adder circuit is connected to output of the carry 
to q-th bit of the carry look-ahead circuit 99 (where q = 2, 3,..., N), which initial carry input and second input of the 
EXCLUSIVE OR gate 96 of the first bit 94 of the adder circuit are connected to "0". In each bit 94 of the adder circuit 
sum output of the half-adder 95 is connected to first input of the EXCLUSIVE OR gate 96 and to non inverted input of 
the first AND gate with inverted input 97, and carry output of the half-adder 95 is connected to non inverted input of the 
second AND gate with inverted input 98. 
[0174] The adder circuit operates as follows. 
[0175] Bits of the first summand vector 



A = (A, A 2 



are applied to the inputs 31 of the adder circuit. Vector A is an N-bit word of M packed data in two's complement pres- 
entation, which are elements of this vector. And last significant bits of vector A are bits of the first datum A 1 , then bits 
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of the second datum A 2 follow, etc. Most significant bits of vector A are bits of the M-th datum A M . With such packing 
the v-th bit of the m-th datum A M is the 

m-1 

<V+2X)-th 



bit of vector A, where N m - the word length of the m-th datum A M of vector A, v=1 ,2 N m , m=1 ,2 M. The number of 

J0 data M in vector A and the number of bits N m in each m-th datum A m of this vector may be any integer value from 1 to 
N (m=1 ,2,...,M). The only restriction is that the total word length of all the data, packed in one vector A, should be equal 
to its word length: 

M 
m=1 

[0176] Bits of the second summand vector are applied to the inputs 32 of the adder circuit, and this vector has the 
same format as that of vector A. 
20 [01 77] Tuning of the adder circuit hardware to process vectors of the required formats is made by means of loading 
the N-bit control word H to its inputs 33. And the value 1 of the n-th bit h n of the word H means that the adder circuit will 
regard the n-th bit of each of the vectors A and B as the most significant bit of the corresponding element of this vector. 
The number of bits with value 1 in the word H is equal to the number of elements in each of the vectors A and B (here- 
after n=1. 2 N): 

25 ' 

N 

£h n = M. 

30 

[0178] In n-th bit 94 of said circuit n-th bit a n of vector A and n-th bit b n of vector B are applied to inputs of the half- 
adder 95. At sum and carry outputs of this half-adder 95 auxiliary signals of carry propagation p n and carry generation 
g n are generated for this bit of the adder circuit: 

35 . p n = a n eb n , g n = a n Ab n . 

[0179] Signals p n and g n supply non inverted inputs respectively of fist 97 and second 98 AND gates with inverted 
input to inverted inputs of which n-th bit h n of control word H is applied: If n-th bits a„ and b n of vectors A and B are not 
sign bits of separate elements, composing these vectors, then h n =0 and signals p n and g n pass to outputs of AND gates 
40 with inverted input 97 and 98 without changes. If n-th bits a n and b n of vectors A and B are sign bits of its elements, 
then h n =l and signals with the value 0 are set to outputs of AND gates with inverted input 97 and 98. Thus, AND gates 
with inverted input 97 and 98 are used to lock signals of carry propagation and of carry generation in those bits 94 of 
said circuit, which process most significant bits of separate elements of input vectors A and B. 

[0180] Signals from outputs of AND gates with inverted input 97 and 98 apply to carry propagation and carry gen- 
45 eration inputs of circuit 99, which is used to accelerate carry signals generation to separate bits of the adder circuit. Any 
known sequential, group or look-ahead carry generation circuit, applied in usual two operand adders, may be used as 
the circuit 99. At outputs of the circuit 99 signals of the carry to separate bits of the adder circuit are generated in 
accordance with following expression c ^ = g n v p n ^ c n . So if h n =1 , then p n - g n =0 and the circuit 99 will gener- 
ate the signal c n+1 =0. 

so [01 81 ] Carry signals, generated by the circuit 99, apply to inputs of the EXCLUSIVE OR gates 96 of respective bits 
94 of the adder circuit to other inputs of which the signals of carry propagation apply from sum outputs of the half adder 
95. At output of the EXCLUSIVE OR gates 96 of each n-th bit 94 of the adder circuit the signal s n = p n © c n is gen- 
erated. Thus, at outputs 34 of the adder circuit a vector 

s=(s, s 2 .... s ¥ ) 

is generated, which each element is equal to the sum of the respective elements of vectors A and B: 
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S m = A m + B m (m-1.2 M). 

And vector S will have the same format as that of vectors A and B. 

5 INDUSTRIAL APPLICABILITY 

[0182] Achievable technical result of the invention lies in increasing of the neural processor performance by means 
of the ability to change in program mode as word lengths of input operands, as word lengths of results. 
[0183] The peculiarity of the offered neural processor is that the user can set the following neural network parame- 
w ters in program mode: the number of layers, the number of neurons and neural inputs in each layer, the word length of 
data at each neural input the word length of each weight coefficient, the word length of output value of each neuron, the 
saturation function parameter for each neuron. 

[0184] One neural processor can emulate a neural network of a practically unlimited size. A neural network is emu- 
lated per layer (sequentially layer-by-layer). 
15 [0185] Every neural network layer is divided to sequentially processed fragments. Each fragment is executed per 
one clock cycle. With reducing the word length of input data and of weight coefficients lager fragment of neural network 
is executed per one clock cycle. A few neural processors may be used to emulate one neural network, which allows to 
reduce the duration of emulation process by a few times. 

[0186] Achievable technical result may be intensified by means of the decreasing of the clock cycle time by insert- 
20 ing input data registers in each saturation unit, in the calculation unit and in the adder circuit. These registers operate 
as pipeline registers, which allow to decrease the neural processor clock cycle time practically by three times. 
[01 87] The neural processor executive units are saturation units, a calculation unit and an adder circuit. Each exec- 
utive unit executes operations over vectors of programmable word length data. In addition these executive units may be 
used as in the offered neural processor, as and in other units of vector data processing. 
25 [0188] Achievable technjcal result of the invention lies in increasing of the saturation unit performance by means of 
the ability to process vector of input operands with programmable word length at a time. In the saturation unit carry look- 
ahead and carry propagation circuits are used, and as a result of that the propagation delay of said unit is approximately 
equal to the propagation delay of a usual two-operand adder. 

[0189] Achievable technical result of the invention lies in the expansion of the calculation unit functionality. Said unit 
30 may execute the multiplication of matrix of data by vector of programmable word length data. This operation is executed 
per one clock cycle, which period is equal to the propagation delay of a usual two operand array multiplier. 
[0190] . Achievable technical result of the invention lies in increasing of the adder circuit performance by means of 
including arithmetic operations over vectors of programmable word lengths data into its operation set. In contradistinc- 
tion to known data vector adders, in the offered adder circuit the lock of signals of the carry between bits of the adder 
35 circuit, processed neighboring operands of input vectors, is implemented on the level of forming the auxiliary functions 
of carry generation and carry propagation. That allows in the adder circuit to use carry propagation circuits, applied in 
usual two-operand adders. So the offered adder circuit, destined for the summation of vectors of programmable word 
length data, has practically the same propagation delay as for two operand adders. 

[0191] The offered neural processor can be efficiently used for calculation of recursive and non recursive convolu- 
te tions, for execution of Hadamard Transform, Fast and Discrete Fourier Transforms, and also for execution of other digital 
signal processing algorithms. 

[0192] The neural processor may be implemented as an independent microcircuit or as a co-processor in computer 
systems. 

45 Claims 

1 . A neural processor, comprising first, second and third registers, a first FIFO and a multiplexer, which first data input 
of every bit is connected to the output of the respective bit of the first register, the data input of every bit of the sec- 
ond register is connected to the respective bit of the first input bus of the neural processor, control inputs of first, 

so second and third registers are respective control inputs of the neural processor, characterized in that it incorpo- 
rates fourth, fifth and sixth registers, a shift register, an AND gate, a second FIFO, a switch from 3 to 2, two satura- 
tion units, an adder circuit and a calculation unit, comprising inputs of first operand vector bits, inputs of second 
operand vector bits, inputs of third operand vector bits, inputs of data boundaries setting for first operand vectors 
and result vectors, inputs of data boundaries setting for second operand vectors, inputs of data boundaries setting 

55 for third operand vectors, first and second inputs of load control of third operand vectors into the first memory block 
input of reload control of third operand matrix from the first memory block to the second memory block and outputs 
of bits of first and second summand vectors of results of the addition of first operand vector and product of the mul- 
tiplication of second operand vector by third operand matrix, stored into the second memory block and first data 
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inputs of bits of the switch from 3 to 2, data inputs of the first FIFO, of first, third and fourth registers and parallel 
data inputs of the shift register are bit-by-bit coupled and connected to respective bits of first input bus of the neural 
processor, which each bit of second input bus is connected to second data input of the respective bit of the switch 
from 3 to 2, which first output of each bit is connected to input of the respective bit of input operand vector of the 

s first saturation unit, which control input of every bit is connected to output of the corresponding bit of the second 

register, second output of each bit of the switch from 3 to 2 is connected to input of the respective bit of input oper- 
and vector of the second saturation unit, which control input of each bit is connected to output of respective bit of 
the third register, output of each bit of result vector of the first saturation unit is connected to second data input of 
the respective bit of the multiplexer, which output of each bit is connected to input of the respective bit of first oper- 

10 and vector of the calculation unit, which input of each bit of second operand vector is connected to output of the 
respective bit of result vector of the second saturation unit, data outputs of the first FIFO are connected to inputs of 
the respective bits of third operand vector of the calculation unit, which output of each bit of first summand vector 
of results of the addition of first operand vector and product of the multiplication of second operand vector by third 
operand matrix, stored into the second memory block, is connected to input of respective bit of first summand vee- 
rs tor of the adder circuit, which input of each bit of second summand vector is connected to output of respective bit 
of second summand vector of results of the addition of first operand vector and product of the multiplication of sec- 
ond operand vector by third operand matrix, stored into the second memory block, a calculation unit, which each 
input of data boundaries setting for first operand vectors and result vectors is connected to output of the respective 
bit of the fifth register and to the respective input of data boundaries setting for summand vectors and sum vectors 

20 of the adder circuit, which output of each bit of sum vector is connected to respective data input of the second FIFO, 
which each data output is connected to the respective bit of output bus of the neural processor and to third input of 
the respective bit of the switch from 3 to 2, output of each bit of the fourth register is connected to data input of the 
respective bit of the fifth register and to the respective input of data boundaries setting for third operand vectors of 
the calculation unit, which each input of data boundaries setting for second operand vectors is connected to output 

25 of the respective bit of the sixth register, which data input of each bit is connected to output of the respective bit of 
the shift register, which sequential data input and output are coupled and connected to first input of load control of 
third operand vectors into the first memory block of the calculation unit and to first input of the AND gate, which out- 
put is connected to read control input of the first FIFO, second input of the AND gate, shift control input of the shift 
register and second input of load control of third operand vectors into the first memory block of the calculation unit 

3 o are coupled and connected to respective control input of the neural processor, input of reload control of third oper- 
and matrix from the first memory block to the second memory block of the calculation unit and control inputs of fifth 
and sixth registers are coupled and connected to the respective control input of the neural processor, control inputs 
of the switch from 3 to 2, of the multiplexer and of the fourth register, write control inputs of the shift register and of 
the first FIFO and read and write control inputs of the second FIFO are respective control inputs of the neural proc- 

35 essor, state outputs of first and second FIFOs are state outputs of the neural processor. 

2. The neural processor as recited in claim 1 , characterized in that the calculation unit comprises a shift register, per- 
formed the arithmetic shift of J bits left on all N-bit vector operands, stored in it, where J - minimal value that is the 
aliquot part of data word lengths in second operand vectors of the calculation unit, a delay element a first memory 

40 block, containing sequential input port and N/J cells to store N-bit data, a second memory block, containing N/J 
cells to store N-bit data, N/J multiplier blocks, each of that multiply N-bit vector of programmable word length data 
by J-bit multiplier, and a vector adding circuit, generated partial product of the summation of N/J + 1 programmable 
word length data vectors, and inputs of third operand vector bits of the calculation unit are connected to data inputs 
of the shift register, which outputs are connected to data inputs of the first memory block, which outputs of each cell 

45 are connected to data inputs of the respective cell of the second memory block, which outputs of each cell are con- 
nected to inputs of multiplicand vector bits of the respective multiplier block, which inputs of the multiplier bits are 
connected to inputs of the respective J-bit group of second operand vector bits of the calculation unit, outputs of 
each multiplier block are connected to inputs of bits of the respective summand vector of the vector adding circuit, 
which inputs of (N/J + 1 )-th summand vector bits are connected to inputs of first operand vector bits of the calcula- 

so tion unit, which inputs of data boundaries setting for third operand vectors are connected to respective inputs of 
data boundaries setting for operand vectors of the shift register, which mode select input is connected to first input 
of load control of third operand vectors into the first memory block of the calculation unit, which second input of load 
control of third operand vectors into the first memory block is connected to clock input of the shift register and to 
input of the delay element which output is connected to write control input of the first memory block, write control 

55 input of the second memory block is connected to input of reload control of third operand matrix from the first mem- 
ory block to the second memory block of the calculation unit which every input of data boundaries setting for sec- 
ond operand vectors is connected to input of the sign correction of the respective multiplier block, inputs of data 
boundaries setting for first operand vectors and for result vectors of the calculation unit are connected to inputs of 
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data boundaries setting for multiplicand vectors and for result vectors of each multiplier block and to inputs of data 
boundaries setting for summand vectors and result vectors of the vector adding circuit, which outputs of bits of first 
and second summand vectors of results are respective outputs of the calculation unit. 

5 3. The neural processor as recited in claim 1 , characterized in that each saturation unit comprises an input data reg- 
ister, which data inputs are inputs of respective bits of input operand vector of said unit, the calculation unit com- 
prises an input data register, which data inputs are inputs of respective bits of first and second operand vectors of 
said unit, the adder circuit comprises an input data register, which data inputs are inputs of respective inputs of the 
adder circuit. 

10 . 

4. A saturation unit, comprising in each of N bits a first multiplexer, which second data input is connected to input of 
the respective bit of input operand vector of said unit, which output of each bit of result vector is connected to output 
of the first multiplexer of the respective bit of said unit, characterized in that it incorporates a carry propagation cir- 
cuit and a carry look-ahead circuit, and its each bit incorporates a second multiplexer and an EXCLUSIVE OR gate, 

75 an EQUIVALENCE gate, a NAND gate and an AND gate with inverted input and non inverted input of the AND gate 

with inverted input and fist inputs of the NAND gate and the EQUIVALENCE gate of each bit of said unit are cou- 
pled and connected to the respective control input of said unit, output of the NAND gate of n-th bit of said unit is 
. connected to input of carry propagation through (N-n + 1)-th bit of the carry look-ahead circuit, which output of the 
carry to (N-n + 2)-th bit is connected to control input of the first multiplexer of n th bit of said unit, output of the AND 

20 gate with inverted input of which is connected to control input of the second multiplexer of the same bit of said unit, 

to carry generation input of (N-n + 1)-th bit of the carry look-ahead circuit and to inverted input of the carry propa- 
gation through (N-n + 1)-th bit of the carry propagation circuit, which carry input from (N-n + 1)-th bit is connected 
to output of the second multiplexer of n-th bit of said unit (where n = 1,2,..., N), first input of the EXCLUSIVE OR 
gate and non inverted input of the AND gate with inverted input of q-th bit of said unit are respectively connected 

25 to second input of the EXCLUSIVE OR gate and to inverted input of the AND gate with inverted input of (q-1)-th bit 
of said unit, first data input of the second multiplexer of which is connected to output of the carry to (N-q + 2)-th bit 

of the carry propagation circuit (where q = 2, 3 N). initial carry inputs of the carry propagation circuit and of the 

carry look-ahead circuit, second input of the EXCLUSIVE OR gate, inverted input of the AND gate with inverted 
input and first data input of the second multiplexer of N-th bit of said unit are coupled and connected to "0", and in 

30 each bit of said unit output of the second multiplexer is connected to second input of the EQUIVALENCE gate, 
which output is connected to first data input of the first multiplexer, which second data input is connected to second 
data input of the second multiplexer and to first input of the EXCLUSIVE OR gate, which output is connected to sec- 
ond input of the NAND gate of the same bit of said unit. 

35 5. The saturation unit as recited in claim 4, characterized in that output of the carry to q-th bit is connected to carry 
input from (q-1 )-th bit in the carry propagation circuit (where q = 1 ,2,.., N). 

6. The saturation unit as recited in claim 4, characterized in that the carry look-ahead circuit comprises AND gates 
and OR gates of quantity of N both, and each input of the carry propagation through the respective bit of said circuit 
40 is connected to first input of the respective AND gate, which output is connected to first input of the respective OR 
gate, which second input and output are respectively connected to carry generation input of the respective bit of 
said circuit and to output of the carry to the same bit of said circuit, second input of the first AND gate is initial carry 
input of said circuit, second input of q-th AND gate is connected to output of (q-1)-th OR gate (where q = 2,3 N). 

45 7. A calculation unit, comprising N/2 decoders of multiplier bits and a multiplier array of N columns by N/2 cells, each 
of them consists of an one-bit partial product generation circuit and an one-bit adder, and respective control inputs 
of the one-bit partial product generation circuits of i-th cells of all columns of the multiplier array are coupled and 
connected to respective outputs of i-th decoder of multiplier bits (where i = 1,2,..., N/2), first input of the one-bit 
adder of each cell of the multiplier array is connected to output of the one-bit partial product generation circuit of 

so the same cell of the multiplier array, characterized in that it incorporates N/2 AND gates with inverted input, a delay 
element and a N-bit shift register, which each bit consists of an AND gate with inverted inputs, a multiplexer and a 
trigger, and each cell of the multiplier array incorporates first and second triggers, functioned us memory cells of 
respectively first and second memory blocks of said unit, an AND gate with inverted input and a multiplexer, and 
input of each bit of first operand vector of said unit is connected to second input of the one-bit adder of the first cell 

55 of the respective column of the multiplier array, control inputs of multiplexers and inverted inputs of the AND gates 
with inverted input of all cells of each column of which are coupled and connected to respective input of data bound- 
aries setting for first operand vectors and for result vectors of said unit, which each input of data boundaries setting 
for second operand vectors is connected to inverted input of the respective AND gate with inverted input, which out- 
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put is connected to first input of the respective decoder of multiplier bits, second and third inputs of i-th decoder of 
multiplier bits are connected to inputs of respectively (2i-1)-th and (2i)-th bits of second operand vector of said unit 

(where i = 1 ,2 N/2), non inverted input of j-th AND gate with inverted input is connected to third input of (j-1)-th 

decoder of multiplier bits (where j = 2, 3,..., N/2), input of each bit of third operand vector of said unit is connected 

5 to second data input of the multiplexer of the respective bit of the shift register, which first data input is connected 
to output of the AND gate with inverted inputs of the same bit of the shift register, which first inverted input is con- 
nected to respective input of data boundaries setting for third operand vectors of said unit, second inverted input of 
the AND gate with inverted inputs of q-th bit of the shift register is connected to first inverted input of the AND gate 
with inverted inputs of (q- 1)-th bit of the shift register (where q = 2, 3 N), non inverted input of AND gate with 

jo inverted inputs of r-th bit of the shift register is connected to trigger output of (r-2)-th bit of the shift register (where 

r = 3, 4 N), control inputs of multiplexers of all shift register bits are coupled and connected to first input of load 

control of third operand vectors into the first memory block of said unit, clock inputs of triggers of all shift register 
bits and input of the delay element are coupled and connected to second input of load control of third operand vec- 
tors into the first memory block, output of the multiplexer of each shift register bit is connected to data input of the 

is trigger of the same bit of the shift register, which output is connected to data input of the first trigger of the last cell 
of the respective column of the multiplier array, output of the first trigger of j-th cell of each multiplier array column 

is connected to data input of the first trigger of (j-1)-th cell of the same multiplier array column (where j=2 ( 3 N/2), 

clock inputs of the first triggers of all multiplier array cells are coupled and connected to output of the delay element, 
clock inputs of the second triggers of all multiplier array ceils are coupled and connected to input of reload control 

20 of third operand matrix from the first memory block to the second memory block, second data input of the one-bit 

partial product generation circuit of i-th cell of q-th multiplier array column is connected to output of the AND gate 

with inverted input of i-th cell of (q-1)-th multiplier array column (where i = 1 . 2 N/2 and q = 2, 3 N). second 

input of the one-bit adder of j-th cell of each multiplier array column is connected to sum output of the one-bit adder 
of the (j-"Q-th cell of the same multiplier array column (where j = 2, 3 N/2), third input of the one-bit adder of j-th 

25 cell of q-th multiplier array column is connected to output of the multiplexer of G-1)-th cell of (q-1)-th multiplier array 

column (where j = 2, 3 N/2 and q = 2, 3,..., N), third input of the one-bit adder of j-th cell of the first multiplier array 

column is connected to third output of (j-1)-th decoder of multiplier bits (where j = 2, 3,..., N/2), sum output of the 
one-bit adder of the last cell of each multiplier array column is output of the respective bit of f irst summand vector 
of results of said unit, output of the multiplexer of the last cell of (q-1)-th multiplier array column is output of q-th bit 

30 of second summand vector of results of said unit (where q = 2, 3,..., N), which first bit of second summand vector 
of results is connected to third output of (N/2)-th decoder of multiplier bits, second inverted and non inverted inputs 
of the AND gate with inverted inputs of the first bit and non inverted input of the AND gate with inverted inputs of 
the second bit of the shift register, second data inputs of the one-bit partial product generation circuits of all cells of 
the first column of the multiplier array, third inputs of one-bit adders of first cells of all multiplier array columns and 

35 non inverted input of the first AND gate with inverted input are coupled and connected to "0", and in each multiplier 
array cell the output of the first trigger is connected to data input of the second trigger, which output is connected 
to non inverted input of the AND gate with inverted input and to first data input of the one-bit partial product gener- 
ation circuit, which third control input is connected to second data input of the multiplexer, which first data input is 
connected to carry output of the one-bit adder of the same cell of the multiplier array. 

8. A adder circuit comprising a carry look-ahead circuit a half -adder and an EXCLUSIVE OR gate in each of N its bits, 
and input of each bit of first summand vector of the adder circuit and input of respective bit of second summand 
vector of the adder circuit are connected respectively to first and second inputs of the half-adder of respective bit 
of the adder circuit which sum output is connected to first input of the EXCLUSIVE OR gate of the same bit of the 

45 adder circuit, which output is output of the respective bit of sum vector of the adder circuit, second input of the 
EXCLUSIVE OR gate of q-th bit of the adder circuit is connected to output of the carry to q-th bit of the carry look- 
ahead circuit (where q = 2, 3,..., N); which initial carry input and second input of the EXCLUSIVE OR gate of the 
first bit of the adder circuit are connected to "0", characterized in that first and second AND gates with inverted 
input are incorporated in each its bit. and sum output of the half-adder of each bit of the adder circuit is connected 

so to non inverted input of the first AND gate with inverted input of the same bit of the adder circuit, which output is 

connected to carry propagation input through the respective bit of the carry look-ahead circuit carry output of the 
half^adder of each bit of the adder circuit is connected to non inverted input of the second AND gate with inverted 
input of the same bit of the adder circuit, which output is connected to carry generation input of the respective bit 
of the carry look-ahead circuit, inverted inputs of first and second AND gates with inverted input of each bit of the 

55 adder circuit ere coupled and connected to respective input of data boundaries setting for summand vectors and 
sumvectors. . " - 
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